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1  Introduction 

This  final  report  summarizes  the  work  done  on  the  project  entitled  “Compiler  Optimizations  for  Power  Aware 
Computing”  (COPAC).  COPAC  was  supported  by  DARPA  contract  #  AFRL  F30602-00-2-0564. 

The  project  started  on  April  28,  2000.  Highlights  of  our  findings  are  listed  in  the  following  sections. 

Enclosed  with  this  report  is  a  separate  final  report  for  a  subproject  started  in  February  2001  on  the  wearable 
motherboard. 


2  Target  and  Sample  2.6X  Reduction  in  Energy*Delay 

Before  presenting  a  sample  highlight  results,  we  first  need  to  briefly  describe  the  base  case. 

2.1  Base  Architecture 

The  typical  embedded  system  consists  of  a  processor,  off-chip  memory  and  Printed  Circuit  Board  (PCB)  buses 
connecting  the  chips  as  shown  in  Figure  1.  We  set  up  a  simulation  environment  to  measure  each  component  shown 
in  Figure  1  accurately.  For  our  processor  model  we  used  MARS,  a  cycle  accurate  Verilog  model  of  5-stage  RISC 
architecture  obtained  from  the  University  of  Michigan [24].  MARS  can  run  ARM  instructions.  Compared  to  the 
high-level  (e.g.,  instruction-level)  power  estimation  methods,  executing  a  Verilog  model  is  more  accurate  but  also 
more  slow.  We  simulated  the  MARS  model  executing  our  applications  using  Synopsys  VCS  and  measured  power 
using  the  Synopsys  Power  Compiler.  To  obtain  an  accurate  transistor-level  model  of  MARS,  we  synthesized  MARS 
using  a  TSMC  0.25 jx  library. 


Figure  1:  Architecture 

To  estimate  memory  power  consumption,  an  analytical  model  based  on  D.  Liu  et  al.[25]  was  used[4].  TSMC 
0.25 fi  technology  parameters  and  switching  activity  from  VCS  simulation  were  fed  as  input  data.  To  estimate  bus 
power,  bus  lengths  on  an  actual  board  (Skiff  board  from  HP)  were  measured  and  the  capacitance  and  power  were 
calculated  [2,  6].  The  compiler  based  technique  is  implemented  on  Trimaran,  a  compiler  framework  including  ARM 
instruction  generator.  [4]  explains  the  detail  procedure  of  the  power  estimation. 
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In  short,  the  base  case,  as  shown  in  Figure  1,  consists  of  (i)  a  100MHz  ARM-like  RISC  processor  whose  layout 
uses  700,125  transistors  in  TSMC  0.25 fi  technology,  (ii)  1MB  or  512KB  of  SRAM  in  a  separate  chip  in  TSMC 
0.25jU  technology  running  at  either  100MHz  or  50MHz,  and  (iii)  a  PCB  with  bus  lines  connecting  the  processor  to 
the  SRAM  memory  chip  (second  level  memory). 

2.2  Sample  2.6X  Reduction  in  Energy*Delay  Using  Data  Remapping  and  Frequency/ 
Voltage  Scaling  of  Memory 

The  energy  optimization  methodologies  we  invented  can  be  categorized  as  software  based  approaches,  hardware 
based  approaches  and  combined  software/hardware  based  approaches.  One  software  based  approach  is  known  as 
“Data  Remapping”  and  is  a  software  (compiler)  technique[3,  5,  7].  This  approach  efficiently  remaps  an  application’s 
data  layout  in  memory  such  that  data  elements  that  are  accessed  contemporaneously  are  also  placed  together  in 
memory  in  contiguous  address  spaces.  In  such  a  way,  data  remapping  reduces  cache  miss  to  the  secondary  memory 
since  each  load  of  a  cache  line  (usually  4  or  8  or  more  words  from  contiguous  memory  addresses). 
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Figure  2:  Energy  distribution  (DR+Freq/Volt  Scale  Mem.,  Health) 

In  this  example  we  combined  Data  Remapping  with  a  hardware-based  approach,  namely,  voltage/frequency 
scaling  of  memory.  We  lowered  supply  voltage  only  for  the  off-chip  memory  while  processor  core  kept  the  same 
voltage  as  the  base  case[2,  6].  We  also  adopted  a  store  buffer  to  offset  the  performance  degradation  due  to  half¬ 
speed  off-chip  memory.  The  end  result  of  the  combination  of  these  techniques  can  be  seen  in  Figure  2.  All  four 
major  consumers  of  energy  shown  on  the  left  of  Figure  2  for  our  base  case  of  Figure  1  -  the  processor  core,  the 
processor’s  LI  cache,  the  off-chip  (L2)  memory  and  the  off-chip  buses  -  decrease  when  we  apply  Data  Remapping 
and  Frequency /Voltage  Scaling  of  Off-Chip  Memory,  as  shown  on  the  right  of  Figure  2.  Overall,  energy  decreases 
from  17.076  Joules  to  9.274  Joules  with  a  decrease  in  execution  time  as  well:  from  8.036  seconds  to  5.78  seconds. 
The  overall  result  achieved  a  60.94%  or  2.6X  reduction  in  energy*delay  [8]. 

3  Compiler/ Architecture  Savings 

Figure  3  shows  some  representative  highlights  from  the  project.  Figure  3  first  shows  some  results  from  hardware 
based  approaches,  specifically,  we  highlight  the  idea  we  pioneered  of  lowering  the  voltage  (and,  unfortunately, 
such  reduction  necessitates  the  reduction  of  frequency  as  the  achievable  circuit  speed  reduces  with  reduced  supply 
voltage)  and  frequency  of  off-chip  L2  memory.  The  best  result  was  a  28.51%  (1.4X)  reduction  in  energy  based  on 
this  technique  alone. 

The  middle  section  of  Figure  3  highlights  some  results  from  software  techniques.  Unlike  the  previous  hardware 
technique  which  decreases  energy  but  unfortunately  also  increases  execution  time  slightly,  these  software  techniques 
decrease  execution  time  and,  as  a  result,  also  decreases  energy  (power  consumption  remains  unchanged  since  no 
voltages  vary  at  all)[3,  5,  7,  23].  While  loop  transformations  did  achieve  up  to  a  50.63%  (2X)  reduction  [11,  14],  the 
best  result  here  is  from  Data  Remapping  which  achieved  a  67.59%  or  3. IX  reduction  in  energy*delay.  [3,  5,  7,  23]. 

The  final  highlight  of  Figure  3  was  already  discussed  in  Section  2.2. 

One  additional  highlight  is  the  discovery  of  the  Energy-Delay  Ratio  paradigm[18,  19,  21,  22].  The  EDR  paradigm 
is  based  on  a  very  interesting  proof:  energy  is  minimized  in  a  VLSI  system  when  energy  consumption  is  proportional 
to  delay  for  each  logic  unit.  For  example,  if  an  adder  has  delay  10ns  and  a  multiplier  60ns,  then  energy  is  minimized 
when  the  multiplier  is  designed  such  that  it  consumes  6X  the  amount  of  energy  of  the  adder.  Using  the  EDR 
paradigm,  quick  power-aware  optimization  decisions  can  be  made  at  the  transistor-,  chip-  and  system-level. 
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Figure  3:  Resulting  Reductions  in  Energy  and  Energy *Delay 


Many  more  results  are  available  by  perusing  the  over  20  publications  resulting  from  this  project.  These  publi¬ 
cations  are  listed  at  the  end  of  this  report  and  are  included  in  a  CD  provided  with  this  report. 

4  Additional  Funding 

4.1  HP  Funding  of  Ozgur  Celebican  for  Cycle- Accurate  Energy  Estimation  of  ARM 
Based  Embedded  Systems 

Ozgur  Celebican,  a  member  of  the  Georgia  Institute  of  Technology  DARPA  team  under  the  supervision  of  PI 
Mooney,  has  been  awarded  by  Hewlett-Packard  funding  for  his  graduate  student  stipend  and  tuition.  Ozgur’s 
project  with  HP  is  cycle-accurate  energy  estimation  of  ARM  based  embedded  systems  with  I/O  devices.  Ozgur  is 
in  the  process  of  modifying  an  ARM  performance  simulator  to  include  I/O  power  simulation  capabilities.  Some 
of  the  research  funded  DARPA  will  is  thus  on  a  technology  transition  path  for  use  in  the  design  of  commercial 
products  for  Hewlett-Packard. 

4.2  NSF  Funding  of  the  EDR  Paradigm 

Co-PI  Chatterjee  has  secured  approximately  $250,000  in  additional  funding  from  the  National  Science  Foundation 
to  continue  his  research  on  power  optimization  based  on  the  EDR  paradigm. 

4.3  Intel  Funding 

Intel  has  granted  some  funds  to  Prof.  Gao  of  the  University  of  Delaware;  Prof.  Gao  was  a  non-optional  subcontract 
to  this  project. 

5  Conclusion 

In  conclusion,  new  technology  approaches  have  been  discovered  in  frequency /voltage  scaling  of  second-level  mem¬ 
ory  and  compiler  optimizations  including  loop  transformations  and  data-remapping.  The  best  overall  result  as 
measured  by  energy*delay  were  a  3. IX  reduction  using  data  remapping.  Also,  a  2.6X  reduction  was  achieved  when 
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combining  data-remapping  and  frequency /voltage  scaling  of  memory;  this  example  also  achieved  the  lowest  power 
(energy/time)  consumption, 

This  first  DARPA  project  for  the  PI  was  an  exciting  and  exhilarating  adventure  in  research  collaboration, 
graduate  student  exhortation  to  achieve  results  and  industry  cooperation  to  find  technology  transfer  paths  both 
for  those  results  as  well  as  for  the  trained  graduate  students  coming  out  of  the  project! 
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ABSTRACT 

In  embedded  systems,  off-chip  buses  and  memory  (i.e.,  L2  mem¬ 
ory  as  opposed  to  the  LI  memory  which  is  usually  on-chip  cache) 
consume  significant  power,  often  more  than  the  processor  itself  In 
this  paper,  for  the  case  of  an  embedded  system  with  one  proces¬ 
sor  chip  and  one  memory  chip,  we  propose  frequency  and  voltage 
scaling  of  the  off-chip  buses  and  the  memory  chip  and  use  a  known 
micro-architectural  enhancement  called  a  store  buffer  to  reduce  the 
resulting  impact  on  execution  time.  Our  benchmarks  show  a  sys¬ 
tem  (processor  +  off-chip  bus  +  off-chip  memory )  power  savings  of 
28%  to  36%,  an  energy  savings  of  13%  to  35%,  all  while  increas¬ 
ing  the  execution  time  in  the  range  of  1%  to  29%.  Previous  work 
in  power-aware  computing  has  focused  on  frequency  and  voltage 
scaling  of  the  processors  or  selective  power-down  of  sub-sets  of 
off-chip  memory  chips.  This  paper  quantitatively  explores  volt¬ 
age/frequency  scaling  of  off-chip  buses  and  memory  as  a  means 
of  trading  off  performance  for  power/energy  at  the  system  level  in 
embedded  systems. 

Keywords 

Voltage/Frequency  Scaling,  Power-Performance  Trade-offs,  Em¬ 
bedded  Systems,  Design  Space. 

1.  INTRODUCTION 

A  typical  embedded  system  consists  of  at  least  three  main  com¬ 
ponents:  a  processor  (often  with  LI  cache),  an  off-chip  memory 
(called  L2  memory)  and  an  off-chip  bus  connecting  the  processor 
and  memory.  The  off-chip  components,  being  highly  capacitive, 
may  consume  as  much  or  more  power  than  the  processor.  This  sug¬ 
gests  that  we  can  gain  significant  reductions  in  power  and  energy 
by  reducing  the  off-chip  voltage  and  frequency.  However,  power 
reduction  from  voltage  (and  corresponding  frequency)  reduction 
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could  be  compromised  by  an  increase  in  execution  time,  thus  re¬ 
sulting  in  overall  increase  in  energy  dissipation  (note  that  the  in¬ 
crease  in  execution  time  is  due  to  increased  memory  access  time 
and  is  a  function  of  the  cache  misses).  We  demonstrate  how  the  per¬ 
formance  impact  of  voltage  and  frequency  scaling  can  be  reduced 
by  implementing  a  known  micro- architectural  technique  called  a 
store  buffer.  While  our  approach  can  apply  to  dynamic  voltage  scal¬ 
ing,  this  paper  only  shows  the  tradeoffs  between  statically  setting 
the  off-chip  bus  and  memory  voltage  at  3.3  Volts  and  frequency  at 
100  MHz  versus  2  Volts  and  50  MHz.  As  is  evident  from  the  above 
description,  it  is  necessary  to  have  an  integrated  framework  in  or¬ 
der  to  quantitatively  explore  the  power-performance  design  space 
at  the  system  level.  Specifically,  the  contribution  of  this  paper  is  as 
follows. 

We  have  combined  the  techniques  of  frequency/voltage  scaling 
of  off-chip  buses  and  memory  (circuit  level  technique)  with  a  store 
buffer  (architectural  technique)  to  realize  reductions  in  both  system 
power  and  system  energy  dissipation  with  a  negligible  impact  on 
the  execution  time. 

The  rest  of  the  paper  is  organized  as  follows:  Section  2  discusses 
the  motivation  for  this  work.  Section  3  gives  an  overview  of  the 
previous  work.  Section  4  describes  the  experimental  infrastructure. 
Section  5  discusses  the  methodology.  Section  6  presents  the  results 
and  Section  7  concludes  the  paper. 

2.  MOTIVATION 

In  embedded  systems,  especially  mobile  applications,  battery 
life  is  a  significant  concern.  It  has  been  established  that  the  battery 
drains  faster  when  power  is  drawn  at  a  higher  rate  [1].  For  example, 
a  battery  which  lasts  for  1000  hours  when  drawing  10  milliamps 
at  1.5  Volts  will  only  last  for  80  hours  when  drawing  100  mil¬ 
liamps  at  the  same  voltage  [1].  Also,  an  old  battery  discharges 
faster  than  a  new  one.  Currently,  few  hooks  exist  to  trade-off  per¬ 
formance  for  power  to  prolong  the  battery  life  of  an  embedded 
system.  For  example,  when  the  user  is  executing  a  time-critical 
application  like  real-time  video-conferencing,  he  might  decide  to 
operate  at  peak  performance  and  high  power.  Or  he  might  opt  for 
low  performance  and  very  low  power  when  executing  low  priority 
applications  like  checking  e-mail.  Or  he  might  chose  an  intermedi¬ 
ate  power-performance  point  based  on  the  existing  battery  capacity. 
Note  that  turning  off  memory  chips  may  not  be  possible  because, 
for  example,  video  data  might  be  stored  in  the  memory  and  may 
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Figure  1:  Experimental  Infrastructure 


need  to  be  used.  Also  note  that,  as  presented  in  this  paper,  this 
is  a  static  technique  and  happens  at  the  beginning  of  the  program 
execution.  We  currently  do  not  address  dynamic  (run-time)  scal¬ 
ing  of  voltage/frequency  of  off-chip  memory.  This  paper  presents 
an  approach  that  can  allow  the  compiler  and/or  the  user  to  decide 
at  which  power-performance  point  to  operate,  e.g.,  based  on  the 
knowledge  of  battery  capacity  and  battery  discharge  pattern. 


3.  PREVIOUS  WORK 


Average  Power  of  Bus  Line 
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Switching  Frequency  (Hz) 
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Figure  2:  Bus  Power  Dissipation  Pattern 


modules),  whichever  gives  the  best  results  [7].  At  the  technology 
level,  various  techniques  have  been  researched  such  as  gating  the 
supply  voltage  to  cache  memories  [14]. 

A  similar  framework  to  our  power  measurement  infrastructure  is 
introduced  in  SimplePower  [4].  SimplePower  is  based  on  a  subset 
of  the  Simplescalar  Instruction  Set  Architecture.  Currently,  Sim- 
plepower  does  not  capture  the  energy  consumed  by  the  control  unit 
of  the  processor  nor  the  clock  generation  nor  the  distribution  net¬ 
work  [13].  On  the  other  hand,  our  work  more  accurately  mod¬ 
els  the  specific  RISC  processor  as  the  measurements  are  based  on 
cycle-accurate  functional  simulations  and  Register  Transfer  Level 
hardware  models  used  along  with  an  actual  technology  library.  An 
additional  aspect  of  our  work  is  the  inclusion  of  power  models  for 
both  off-chip  memory  [19]  and  the  Printed  Circuit  Board  bus. 


There  has  been  significant  work  in  the  field  of  voltage  and  fre¬ 
quency  scaling.  Voltage  scaling  techniques  have  been  investigated 
at  almost  all  levels  of  the  design  hierarchy  from  the  system  level  to 
the  device  level  due  to  the  quadratic  effect  on  the  switching  power 
dissipation.  However,  as  the  supply  voltage  becomes  lower,  the  cir¬ 
cuit  delay  increases  and  the  performance  degrades  [2].  Techniques 
to  improve  performance  fall  into  three  main  categories:  reducing 
the  threshold  voltage  to  improve  circuit  speeds,  introducing  paral¬ 
lelism  into  the  architecture  while  using  slower  device  speeds,  and 
using  multiple  supply  voltages  to  choose  the  lowest  supply  voltage 
for  different  circuit  components  that  still  satisfies  the  speed  require¬ 
ments.  Our  approach  falls  into  the  third  category. 

At  the  system/architecture  level,  a  number  of  memory  optimiza¬ 
tion  schemes  for  low  power  have  been  developed  [3,  5].  Briefly, 
those  approaches  can  be  categorized  as  follows:  cache  optimiza¬ 
tion,  memory  access  reduction  (especially  for  off-chip  memory), 
memory  sizing/structuring  and  memory-intensive  voltage  scaling. 
Our  work  falls  in  the  category  of  memory  intensive  voltage  scal¬ 
ing. 

There  has  also  been  a  lot  of  work  on  system  level  power  anal¬ 
ysis  for  software  power  dissipation.  In  [6],  profiling  hardware  is 
used  to  identify  tightly  coupled  regions  of  code  and  dynamically 
optimize  the  configuration  of  the  microprocessor  so  as  to  mini¬ 
mize  performance  penalty.  Our  work  is  similar  to  this  approach 
in  the  sense  that  we  set  the  performance  parameters,  namely,  volt¬ 
age  and  frequency  based  on  the  requirements  of  the  application. 
However,  note  that  ours  is  a  static  technique  rather  than  a  dynamic 
technique.  Also,  there  has  been  work  on  dynamically  adjusting 
the  speeds  (i.e.,  either  lower  the  frequency  or  shutdown  the  unused 


4.  EXPERIMENTAL  INFRASTRUCTURE 

We  consider  an  embedded  system  which  consists  of  a  classic 
five  stage  pipeline  RISC  processor  core  with  4  kilobytes  of  in¬ 
struction  cache,  4  kilobytes  of  data  cache,  a  single  off-chip  syn¬ 
chronous  SRAM  memory  of  size  0.5  Megabytes  organized  as  128K 
X  4  bytes,  and  a  bus  interface  consisting  of  a  32  bit  address  bus, 
32  bit  data  bus  and  the  read/write  control  signals  between  the  pro¬ 
cessor  core  and  the  SRAM  memory. 

Our  experimental  infrastructure  (Figure  1)  consists  of  four  main 
components:  a  C  Compiler;  MARS  -  obtained  from  the  University 
of  Michigan  -  a  cycle- accurate  Verilog  Model  of  a  RISC  processor 
capable  of  running  ARM  instructions  [12];  a  power  model  for  off- 
chip  buses;  and  a  power  model  for  memory. 

We  use  the  GNU-gcc  ARM  cross  compiler  version  egcs-2.91.66. 
For  each  benchmark  we  consider,  we  compile  the  benchmark  to  re¬ 
locatable  ARM  assembly  code  using  GNU-gcc  ARM  cross  com¬ 
piler.  Then  we  use  the  GNU  cross-assembler  to  generate  a  binary 
executable  targeted  towards  ARM  architectures.  Then  we  trans¬ 
late  the  binary  into  an  ascii  format  called  VHX  (Verilog  HeX)  [20] 
which  is  suitable  for  being  simulated  on  MARS  using  the  Synop- 
sys  VCS  simulator  [8].  The  simulation  experiments  are  carried  out 
in  two  modes.  In  the  first  mode,  the  CPU  core  and  off-chip  buses 
and  memory  are  all  operating  at  100  MHz,  a  setup  that  is  very  sim¬ 
ilar  to  a  hardware  setup  we  have  in  the  Hewlett-Packard  "Skiff" 
Personal  Server  board  [17]  with  a  StrongARM  SA-110  processor 
and  16  megabytes  of  off-chip  memory  [18].  (Note  that  we  model  a 
smaller  off-chip  memory  of  size  0.5  megabytes  in  accordance  with 
the  smaller  applications  we  consider.)  We  obtained  the  simulation 
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Average  Power  of  SRAM  Model 


Figure  3:  Memory  Power  Dissipation  Pattern 


model  for  the  off-chip  memory  from  IDT  Technologies,  Inc.  [15]. 
In  the  second  mode,  the  processor  core  includes  a  Verilog  descrip¬ 
tion  of  a  store  buffer  integrated  into  the  core  and  interfaced  to  off- 
chip  buses  and  off-chip  L2  memory  operating  at  50  MHz.  In  both 
cases,  using  the  set  of  benchmarks  in  Table  1,  each  benchmark  is 
simulated  and  switching  activity  is  collected  for  the  processor  core, 
off-chip  buses  and  off-chip  memory  models.  The  switching  activity 
is  fed  to  the  power  models  of  the  core,  off-chip  buses  and  off-chip 
memory  along  with  the  technology  parameters  of  a  TSMC  0.25// 
CMOS  technology  standard  cell  library  from  Leda  Systems  [9]  to 
obtain  power  and  energy  estimates. 

4.1  On-chip  Datapath  Power  Estimation 

We  use  a  synthesis  based  methodology  for  developing  the  power 
models  for  the  submodules  belonging  to  the  datapath  (we  consider 
the  datapath  to  consist  of  the  fetch  unit,  decode  unit,  register  file, 
arithmetic  logic  unit,  data  cache  access  unit  and  writeback  unit). 
The  synthesis  infrastructure  consists  of  two  software  tools  from 
Synopsys,  Inc.:  the  Design  Compiler  and  Power  Compiler  [8].  De¬ 
sign  Compiler  generates  the  gate  level  netlist  from  the  hardware  de¬ 
scription  of  the  submodules,  and  Power  Compiler  generates  power 
estimates  for  each  of  the  synthesized  netlists.  The  Verilog  RTL  de¬ 
scription  is  given  as  input  to  the  Synopsys  Design  Compiler.  The 
output  netlist  is  generated  using  a  TSMC  0.25//  CMOS  technology 
standard  cell  library  from  LEDA  Systems  [9],  The  technology  de¬ 
tails  include  features  such  as  transistor  width,  transistor  length,  gate 
capacitance,  drain  capacitance,  transistor  rise  time  and  transistor 
fall  time.  The  TSMC  0.25//  standard  cell  library  is  characterized 
for  leakage  power,  thus  enabling  us  to  include  both  dynamic  and 
static  power  and  energy  in  our  analysis. 

The  synthesis  process  was  guided  by  fixing  the  maximum  delay 
and  maximum  area.  The  maximum  delay  was  set  to  10  ns  and  the 
maximum  area  was  fixed  to  infinity  so  as  to  get  the  fastest  imple¬ 
mentation.  In  our  case,  the  modules  were  synthesized  to  operate 
at  greater  than  or  equal  to  100  MHz  (i.e.,  at  less  than  or  equal  to  a 
10  ns  cycle  time). 

We  use  Power  Compiler  from  Synopsys  to  estimate  the  power  of 
on-chip  components.  The  Power  Compiler  obtains  the  switching 
activity  of  the  various  functional  modules  based  on  the  simulation 
of  benchmarks  on  MARS.  Then,  Power  Compiler  annotates  this 
switching  activity  onto  the  synthesis  environment  and  obtains  esti¬ 
mates  of  the  dynamic  and  static  power  dissipation  for  the  particular 
technology  chosen  (in  our  case,  TSMC  0.25//  CMOS  technology). 


Figure  4:  System  Level  Block  Diagram 


Benchmark 

Size  (bytes) 

Explanation 

bubble 

8131 

Bubble  Sort  Program 

factorial 

6641 

Factorial  Program 

fib 

6815 

Fibonacci  Sequence  Calculation 

matmul 

8058 

Matrix  Multiplication 

sort_int 

7241 

Integer  Array  Sort 

Table  1:  Benchmarks 


4.2  Off-chip  Bus  Power  Analysis 

We  use  Spectre  simulation  to  obtain  the  power  for  off-chip  buses. 
The  driver  component  is  modeled  by  a  series  of  inverters  (buffers) 
of  increasing  size  and  the  model  is  designed  using  TSMC  0.25// 
CMOS  process  technology  parameters.  The  parameters  of  TSMC 
0.25//  process  technology  are  available  through  MOSIS  [11].  The 
bus  line  capacitance  values  are  obtained  from  actual  measurement 
on  a  PCB  board  using  the  Intel  StrongARM  processor  [18].  The 
details  of  the  measurement  procedure  has  been  explained  in  [20]. 

The  graph  in  Figure  2  describes  the  dependence  of  the  power  dis¬ 
sipation  on  the  switching  frequency  of  the  bus  and  also  the  power 
supply  voltage  Vdd •  The  power  dissipation  is  found  to  decrease 
as  the  switching  frequency  and  the  supply  voltage  are  decreased. 
(Switching  frequency  is  how  often  a  signal  actually  switches  val¬ 
ues.  Clearly,  switching  frequency  of  a  signal  is  data-  or  profile- 
dependent,  i.e.,  dependent  on  a  particular  benchmark  and  data.)  As 
the  supply  voltage  decreases,  the  average  power  dissipation  reduces 
quadratically.  As  the  switching  frequency  decreases,  the  average 
power  dissipation  decreases  almost  linearly.  Note  that  the  simulta¬ 
neous  halving  of  both  voltage  and  frequency  results  in  cubic  sav¬ 
ings. 

4.3  Memory  Power  Analysis 

We  use  an  analytical  SRAM  model  [19]  for  the  off-chip  mem¬ 
ory  and  cache  power  dissipation.  For  the  off-chip  memory  power 
model,  we  updated  the  analytical  model  [19]  using  the  TSMC  0.25// 
process  technology  parameters.  We  use  the  switching  activity  from 
simulations  to  obtain  estimates  for  SRAM  memory  power  dissipa¬ 
tion.  The  variation  of  power  for  the  memory  with  supply  voltage 
Vdd  and  the  switching  frequency  is  thus  found. 

The  graph  in  Figure  3  describes  the  dependence  of  the  power  dis¬ 
sipation  of  the  memory  with  the  power  supply  voltage  Vdd  and  the 
switching  frequency  for  a  memory  size  of  0.5  megabytes  (note  that 
switching  activity  period  is  the  reciprocal  of  switching  frequency). 
As  the  supply  voltage  is  decreased,  the  average  power  dissipation  is 
found  to  decrease  quadratically.  Also  the  memory  delay  was  found 
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to  double  as  we  reduced  the  voltage  from  3.3  Volts  to  2  Volts,  re¬ 
ducing  the  maximum  clocking  frequency  possible  from  100  MHz 
at  3.3  Volts  to  50  MHz  at  2  Volts. 

For  the  on-chip  cache  power  dissipation  model,  we  use  the  same 
SRAM  model  [19].  The  model  is  combined  with  the  capacitance 
values  obtained  from  the  TSMC  0.25/1  CMOS  technology  param¬ 
eters.  Note  that  the  main  difference  between  the  off-chip  memory 
model  and  on-chip  cache  model  is  that  the  off-chip  model  has  sig¬ 
nificantly  higher  capacitance  values  (due  to  size)  than  the  on-chip 
model. 


4.4  System  Level  Power/Energy  Model 

We  define  the  sum  total  of  the  processor  core  power,  bus  power 
and  the  memory  hierarchy  power  as  the  system  power  Psys . 


Psys  —  Pcpu  +  Pb  us  +  Prnem 

Pcpu  is  the  power  dissipated  by  the  processor  core, 
Pbus  is  the  power  dissipated  by  the  off-chip  buses, 
Pmem  is  the  power  dissipated  by  the  memory. 


(1) 


Also,  we  calculate  the  system  energy  Esys  for  each  benchmark 
by  multiplying  the  execution  time  collected  by  simulation  and  the 
corresponding  sysem  power  Psys. 


Esys  —  Psys  ^  t  (2) 

Psys  is  the  power  dissipated  by  the  system, 
t  is  the  total  execution  time  of  the  benchmark. 


5.  METHODOLOGY 

We  now  explain  the  method  we  use  to  explore  voltage  and  fre¬ 
quency  scaling  as  a  static  technique  for  design  space  exploration  in 
terms  of  power  versus  performance  trade-offs. 

5.1  Voltage/Frequency  Scaling 

In  particular,  we  analyze  the  case  where  we  reduce  the  voltage 
of  the  off-chip  memory  and  off-chip  buses  from  3.3  Volts  to  2  Volts 
(note  that  our  original  system  consists  of  the  MARS  [12]  proces¬ 
sor  powered  at  2.75  Volts  with  the  memory  buses  and  memory  chip 
powered  at  3.3  Volts).  The  resulting  increase  in  memory  delay  (it 
doubles)  is  taken  care  of  by  reducing  the  off-chip  bus  and  off-chip 
memory  frequency  from  100  MHz  to  50  MHz.  Of  course,  reduc¬ 
ing  off-chip  bus  and  off-chip  memory  frequency  increases  program 
execution  time  in  proportion  to  cache  misses.  We  help  offset  this 
effect  by  adding  a  store  buffer  to  the  processor  model.  The  store 
buffer  helps  to  reduce  cache  miss  penalties  generated  due  to  "store" 
instructions  (since  the  cache  model  follows  a  read- write  allocation 
policy,  cache  miss  servicing  takes  up  a  significant  amount  of  time). 
On  a  cache  miss  on  store,  the  processor  stores  the  data  into  the  store 
buffer.  The  store  buffer  assumes  the  responsibility  of  flushing  the 
data  into  the  main  memory. 

We  achieve  frequency  scaling  by  simulating  the  off-chip  com¬ 
ponents  at  the  lower  frequency  of  50  MHz.  We  achieve  voltage 
scaling  by  using  the  resulting  switching  activity  with  the  reduced 
voltage  value  of  2  Volts  to  calculate  the  off-chip  bus  and  memory 
power  dissipation  values. 

5.2  Store  Buffer  Technique 

Figure  4  shows  an  architectural  description  of  the  embedded  sys¬ 
tem  with  a  store  buffer  included.  Note  that  the  store  buffer  is  a 
16  entry  buffer.  Stores  to  external  memory  are  first  placed  in  the 
store  buffer  and  subsequently  taken  out  when  the  off-chip  bus  is 
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Figure  5:  Power/Energy  Saving  Results 


available.  Thus,  we  reduce  cache  miss  latency  due  to  "store"  in¬ 
structions.  The  store  buffer  maintains  synchronization  with  the  on- 
chip  data  cache  as  well  as  with  the  off-chip  memory.  We  use  the 
Synopsys  Power  Compiler  to  estimate  the  additional  power  con¬ 
sumed  due  to  the  addition  of  store  buffer  into  the  processor  core. 
Since  we  have  integrated  the  description  of  the  store  buffer  into 
MARS,  the  new  system  level  power  dissipation  includes  the  store 
buffer  overhead. 

5.3  Design  Space  Exploration 

In  this  section,  we  discuss  the  design  space  exploration  of  the 
power-performance  space  using  voltage  and  frequency  scaling.  The 
Verilog  model  (MARS)  is  simulated  in  two  modes.  In  the  first 
mode,  processor,  off-chip  buses  and  off-chip  memory  operate  at 
100  MHz.  In  the  second  mode,  the  processor  operates  at  100  MHz 
while  the  off-chip  buses  and  off-chip  memory  operate  at  50  MHz. 
In  both  cases,  the  Power  Compiler  collects  the  switching  activity 
and  uses  the  collected  switching  activity  to  give  an  estimate  of  dy¬ 
namic  and  static  power  dissipation  of  all  the  modules  within  the 
core.  We  also  collect  the  switching  activity  of  off-chip  memory  el¬ 
ements  and  off-chip  buses  and  feed  the  collected  switching  activity 
information  to  the  analytical  models  to  obtain  the  power  estimates 
for  off-chip  bus  and  off-chip  memory.  We  obtain  the  system  level 
power/energy  estimates  as  explained  in  Section  4.4. 

Our  calculations  do  not  include  the  extra  overhead  of  multiple 
supply  voltage  generation  since  we  assume  that  the  board  already 
has  the  multiple  supply  voltages  needed.  For  example,  the  "Skiff" 
Personnal  Server  Board  from  HP/Compaq  [18]  has  2  Volt,  3.3  Volt 
and  5  Volt  power  supplies.  Also,  currently  there  are  memory  chips 
from  NEC  Semiconductors  where  the  chip  component  operates  at 

3.3  Volts  and  the  I/O  buffer  component  can  operate  at  either  3.3  Volts 
or  2.5  Volts  [16].  For  example,  the  memory  chip  juPD4442361  Syn¬ 
chronous  SRAM  can  operate  at  3.3  Volt  chip  core  voltage  and  either 
3.3  Volt  or  2.5  Volt  I/O  buffer  voltage.  The  ^PD4442361  is  avail¬ 
able  in  three  speed  grades  of  133  MHz,  117  MHz  and  100  MHz 
with  the  corresponding  access  times  of  6.5  ns,  7.5  ns  and  8.5  ns  [16]. 
Therefore,  we  think  it  is  reasonable  to  assume  that  our  system  will 
have  at  least  three  supply  voltages  readily  available  and  that  the 
system  memory  chips  are  capable  of  operating  at  dual  voltages. 
The  choice  of  frequency  of  off-chip  buses  and  memory  is  currently 
assumed  to  be  made  statically  by  a  mechanical  or  electrical  switch. 
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benchmark 

Executable  size  (kB) 

Dynamic  instruction  count 

Input  data  size 

Data  cache  accesses 

Data  cache  misses 

Data  cache  miss  % 

bubble 

34.852 

7503 

50  integers  array 

1675 

107 

6.39 

factorial 

34.634 

6033 

1  integer 

2006 

250 

12.46 

fib 

34.651 

30602 

1  integer 

11840 

262 

2.21 

matmul 

34.857 

21642 

0.5  kB 

7358 

4916 

66.81 

sort_int 

34.763 

23171 

0.5  kB 

7808 

104 

1.33 

Table  2:  Execution  Statistics  for  Various  Benchmarks 


Off-chip  Bus,  Memory  at  100  MHz,  3.3  V 

Off-chip  Bus,  Memory  at  50  MHz,  2  V 

%  Improvement 

Benchmark 

cpu+cache  (W) 

bus  (mW) 

L2  memory(mW) 

Total  (W) 

cpu+cache(W) 

bus(mW) 

L2  memory(mW) 

Total  (W) 

bubble 

1.24 

301.64 

1276.49 

2.817 

1.22 

96.14 

541.08 

1.857 

34.07 

factorial 

1.18 

444.35 

1236.96 

2.861 

1.15 

93.16 

797.08 

2.040 

28.69 

fib 

1.25 

287.68 

1228.23 

2.766 

1.25 

92.50 

516.06 

1.859 

32.79 

matmul 

1.07 

637.48 

1713.34 

3.421 

1.04 

129.04 

1143.51 

2.313 

32.39 

sort_int 

1.27 

336.78 

1485.92 

3.093 

1.27 

111.91 

604.11 

1.986 

35.79 

Table  3:  System  Level  Power  Estimates 


6.  RESULTS 

The  benchmarks  in  Table  1  are  chosen  to  be  computationally 
intensive.  Also,  the  size  of  the  data  has  been  suitably  modified 
so  as  to  generate  a  significant  number  of  LI  cache  misses  as  can 
be  seen  from  Table  2.  For  example,  with  matmul  we  increased 
the  array  sizes  so  that  the  4KB  LI  data  cache  could  not  hold  the 
working  set.  Some  of  these  benchmarks  constitute  the  kernel  of 
many  signal-processing  algorithms. 

Table  2  shows  the  dynamic  instruction  count,  cache  access  and 
miss  statistics  for  the  given  benchmarks.  Note  that  the  final  miss 
rates  are  smaller  than  the  average  miss  rates  at  the  beginning/middle 
of  the  execution  of  the  program  due  to  the  temporal  and  spatial  lo¬ 
cality  of  the  cache  memories.  Also  note  that  the  matmul  bench¬ 
mark  has  a  very  high  miss  rate.  As  a  direct  consequence  of  this, 
this  benchmark  experiences  high  off-chip  traffic.  As  we  will  see 
later,  benchmarks  such  as  this,  which  need  high  off-chip  band¬ 
width,  show  correspondingly  lower  improvement  in  terms  of  en¬ 
ergy  by  our  technique  (although  they  still  benefit  in  terms  of  power) 
as  the  power  savings  are  nullified  by  the  high  increases  in  program 
execution  time. 

Table  3  presents  the  results  related  to  system  level  power.  The 
first  three  columns  indicate  the  three  components  of  power,  namely, 
core  power  dissipation,  off-chip  bus  power  dissipation  and  off-chip 
memory  power  dissipation  for  the  case  where  the  MARS  processor 
core  operates  at  2.75  Volts  and  100  MHz  and  the  off-chip  system 
operates  at  3.3  Volts  and  100  MHz.  The  next  three  columns  indi¬ 
cate  the  same  three  components  of  power  for  the  case  where  the 
off-chip  bus  and  memory  operate  at  2  Volts  and  50  MHz  while 
the  MARS  processor  core  still  runs  at  2.75  Volts  and  100  MHz. 
The  system  level  power  estimate  is  obtained  by  adding  up  the  core 
power,  bus  power  and  the  memory  power  as  shown  in  the  fifth  and 
ninth  "Total (W)"  columns  in  Table  3.  Note  that  the  proposed  tech¬ 
nique  has  reduced  the  power  dissipation  by  an  average  of  32%  on 
the  five  benchmarks.  Also  note  that  the  average  power  reduction 
is  almost  uniform  across  all  benchmarks  irrespective  of  their  exe¬ 
cution  characteristics  like  execution  time  and  dynamic  instruction 
mix.  This  highlights  the  dominant  effect  of  voltage  and  frequency 
on  the  power  dissipation. 

Table  4  presents  the  statistics  for  the  architectural  and  circuit 
level  design  space  exploration  where  execution  time  represents  the 
performance  axis  and  system  power  represents  the  power  axis.  Note 
that,  as  expected,  matmul  benchmark  has  a  higher  penalty  in  terms 
of  the  execution  time  due  to  high  off-chip  traffic  for  loading  and 


storing  the  data  arrays. 

The  energy  column  combines  both  the  design  space  axes,  namely 
performance  axis  and  the  power  axis  and  serves  as  a  baseline  for 
analyzing  power-performance  trade-offs.  The  trade-off  shown,  for 
example,  the  factorial  benchmark,  shows  a  performance  penalty  of 
11.37%  in  return  for  a  power  reduction  of  28.69%  and  an  energy 
reduction  of  20.48%.  All  benchmarks  show  improvements  in  both 
power  and  energy.  However,  as  expected,  the  execution  time  in¬ 
creases.  This  shows  that  our  technique  reduces  both  power  and 
energy  by  virtue  of  reducing  the  fraction  of  the  off-chip  bus  and 
memory  power  consumed,  at  a  (possibly  small)  penalty  in  increased 
execution  time. 

Figure  5  presents  the  overall  results  of  the  system  level  power 
and  system  level  energy.  Our  technique  of  static  voltage/frequency 
scaling  combined  with  the  store  buffer  is  seen  to  reduce  both  power 
and  energy  at  the  system  level. 


7.  CONCLUSION  AND  FUTURE  WORK 

In  conclusion,  we  have  shown  how  a  simple  trick  of  cutting  off- 
chip  voltage  from  3.3  Volts  to  2  Volts  (note  that  this  also  cuts  the 
voltage  for  the  processor  I/O  pad  driver  logic),  together  with  the 
enabled  (due  to  extra  latency  available)  reduction  in  frequency  of 
off-chip  buses  and  memory,  reduces  both  power  and  energy.  The 
basic  point  is  that  both  the  compiler  and  the  programmer  can  take 
advantage  of  smart  architectural  and  memory  hierarchy  features, 
which  allow  the  reduction  of  power  with  some  corresponding  trade¬ 
offs  in  terms  of  performance.  For  example,  the  choice  of  100  MHz 
versus  50  MHz  for  off-chip  bus  frequency  could  be  made  dynam¬ 
ically  programmable  (e.g.,  by  writing  to  a  special  on-chip  register 
or  a  memory-mapped  location).  In  this  case,  then,  code  could  be 
written  which  operates  at  high  performance  and  high  power  during 
critical  times,  but  scales  down  to  lower  performance  and  low  power 
during  non-critical  time  periods.  This  paper  lays  the  groundwork 
for  such  a  system  where  off-chip  buses  and  memory  and  possibly 
more  peripherals  have  their  power  dissipations  modulated  either  by 
the  user  or  by  the  compiler. 

We  will  look  at  further  hiding  the  load-instruction  memory  ac¬ 
cess  latency  caused  by  slowing  down  the  off-chip  buses  and  off- 
chip  memory.  We  will  explore  other  configurations  for  off-chip 
memory  and  also  other  memory  technologies  (e.g.,  SDRAMs).  We 
will  pursue  making  the  voltage  and  frequency  scaling  dynamic. 
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Off-chip  Bus,  Memory  at  100  MHz,  3.3  V 

Off-chip  Bus,  Memory  at  50  MHz,  2  V 

Percent  Change 

Benchmark 

Execn  Time  (fi s) 

Power  (W) 

Energy  (mJ) 

Execn  Time  (jus) 

Power  (W) 

Energy  (mJ) 

Execn  Time  increase  (%) 

Energy  decrease  (%) 

bubble 

113.945 

2.817 

0.321 

122.265 

1.857 

0.227 

7.3 

29.3 

factorial 

116.115 

2.861 

0.332 

129.325 

2.040 

0.264 

11.37 

20.48 

fib 

456.795 

2.766 

1.263 

463.245 

1.859 

0.861 

1.4 

31.83 

matmul 

924.735 

3.421 

3.164 

1192.98 

2.313 

2.759 

29.0 

12.8 

sort_int 

296.425 

3.093 

0.917 

300.265 

1.986 

0.596 

1.29 

35.0 

Table  4:  System  Level  Design  Space  Exploration 
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Abstract 

In  this  paper  we  show  that  the  energy  reductions  obtained  from 
using  two  techniques,  data  remapping  and  voltage/frequency 
scaling  of  off-chip  bus  and  memory,  combine  to  provide  inter¬ 
esting  trade  offs  between  energy,  execution  time  and  power. 
Both  methods  aim  to  reduce  the  energy  consumed  by  the  mem¬ 
ory  subsystem.  Data  remapping  is  a  fully  automatic  compile 
time  technique  applicable  to  pointer-intensive  dynamic  appli¬ 
cations.  Voltage/frequency  scaling  of  off-chip  memory  is  a 
technique  applied  at  the  hardware  level.  When  combined  to¬ 
gether,  energy  reductions  can  be  as  high  as  46% .  The  improve¬ 
ments  are  verified  in  the  context  of  two  OLDEN  pointer-centric 
benchmarks,  namely  Perimeter  and  Health. 

1  Introduction 

In  embedded  systems,  memory  is  a  significant  power/energy 
sink,  often  consuming  as  much  as  half  of  the  total 
power/energy  [2].  In  this  paper  we  focus  on  simultaneously 
applying  a  hardware  technique  and  a  compile  time  technique 
in  order  to  obtain  significant  energy  savings.  Our  target  pro¬ 
cessor  is  an  ARM-like  processor.  The  two  methods  we  ap¬ 
ply  are  data  remapping  (compile  time  technique)  and  volt¬ 
age/frequency  scaling  of  the  off-chip  bus  and  memory. 

An  embedded  system  usually  consists  at  least  of  a  processor 
(including  LI  cache),  off-chip  memory  and  off-chip  bus.  The 
off-chip  components  are  highly  capacitive.  This  causes  them 
to  consume  close  to  half  of  the  digital  system  power  (where 
digital  system  power  is  the  power  consumed  by  the  processor 
plus  memory).  Slowing  down  the  off-chip  memory  by  scaling 
the  voltage  and  frequency  [19, 20, 15]  can  be  used  to  reduce  the 
energy  consumed  by  the  off-chip  memory.  But  slowing  down 
the  off-chip  memory  will  also  reduce  performance. 

Data  remapping[ll,  10]  is  a  compile  time  technique.  It  is 
used  to  remapp  the  application’s  data  layout  so  that  data  el¬ 
ements  that  are  accessed  contemporaneously  are  placed  to¬ 
gether  in  memory.  Remapping  improves  spatial  locality  and 
thus  reduces  cache  misses.  Cache  misses  are  expensive  in 


terms  of  performance.  Remapping  leads  to  a  reduction  in  the 
execution  time  and  energy. 

The  main  drawback  of  the  voltage/frequency  scaling  tech¬ 
nique  is  the  reduction  in  performance.  Combining  the  two 
techniques  can  increase  the  overall  reduction  in  energy.  Fur¬ 
thermore,  where  execution  time  allocated  is  fixed,  the  combi¬ 
nation  of  faster  execution  due  to  data  remapping  can  offset  the 
slower  execution  time  due  to  reduced  clock  frequency  for  off- 
chip  memory  resulting  in  the  same  original  execution  time  at 
dramatically  reduced  power  (and  energy).  Section  2  described 
related  work.  Section  3  describes  the  experimental  infrastruc¬ 
ture  used  for  estimating  the  power  and  energy  of  the  system  for 
the  before  and  after  cases.  Section  4  gives  an  overview  of  the 
data  remapping  algorithm.  Section  5  describes  the  method¬ 
ology  used  for  voltage  and  frequency  scaling,  and  Section  6 
gives  our  design  space  exploration.  Section  7  describe  the  re¬ 
sults  obtained  after  applying  the  two  methods  in  terms  of  en¬ 
ergy  savings,  and  Section  8  concludes  the  paper. 


2  Related  Work 

A  framework  similar  to  our  infrastructure  is  Simplescalar 
ARM.  It  is  a  framework  for  power  and  performance  analysis. 
Another  is  SimplePower[14]  which  implements  a  subset  of  the 
instructions  supported  by  Simplescalar. 

Unlike  Simplescalar,  our  model  (MARS,  introduced  later) 
is  at  the  RTL  (Register  Transfer  Level)  level  and  thus  is  more 
accurate. 

With  respect  to  hardware  techniques  there  is  a  lot  of 
work  going  on  to  gate  supply  voltage  to  cache  memories  [6, 
12],  dynamically  adjust  the  frequency  or  shutdown  unused 
modules  [4]. 

Related  work  in  data  reorganization [7]  propose  automated 
field  re-ordering  that  assigns  temporally  related  fields  to  adja¬ 
cent  memory  locations.  But  they  offer  only  partial  solutions 
as  they  do  not  consider  fields  between  different  instances  of  a 
record. 
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Figure  1 :  Experimental  Infrastructure 


3  Experimental  Setup 

In  this  section  we  describe  the  experimental  setup  used  to  sim¬ 
ulate  and  evaluate  the  combined  techniques  of  data  remapping 
and  voltage/frequency  scaling. 

The  core  of  the  simulation  environment  is  MARS  (Michigan 
Arm  Simulator,  obtained  from  University  of  Michigan)  [16], 
capable  of  running  ARM  instructions.  The  power  of  the  core 
processor  can  be  estimated  using  Synopsys  Power  Compiler. 
The  switching  activity  of  the  various  nets  is  collected  via  sim¬ 
ulation.  The  MARS  model  was  synthesized  using  TSMC 
.25u  library [18].  Using  Synopsys  Power  Compiler[17],  power 
models  of  the  synthesized  MARS  model  were  created.  The 
processor  power  was  estimated  using  the  power  models  and 
the  switching  activity  of  the  nets  at  2.75  volts. 

The  remapping  benchmarks  were  implemented  using 
TRIMARAN,  a  compiler  framework  (which  includes  the 
TRICEPS  [8]  ARM  code  generator  and  smart  memory  and 
cache  hierarchy  simulator  (SMACHS)[21].  The  execution 
statistics  from  TRIMARAN  are  used  to  estimate  the  power  of 
the  memory  subsystem. 

The  model  used  has  an  LI  on-chip  cache  and  L2  off-chip 
cache.  To  estimate  the  energy  of  the  primary  and  secondary 
cache  we  assume  an  SRAM  model.  The  Kamble  and  Ghose 
approach  is  used  for  energy  estimation [9].  Execution  statis¬ 
tics  such  as  cache  requests,  read  hits  and  misses,  write  hits 
and  misses,  execution  cycles  for  both  LI  and  L2  cache  are 
required.  These  execution  statistics  were  obtained  from  sim¬ 
ulations  using  SMACHS.  Also,  information  about  the  cache 
configuration  such  as  cache  size,  block  size  and  tag  size  are 
required.  One  of  the  main  disadvantages  of  the  Kamble  and 


Ghose  method  is  that  it  does  not  model  the  I/O  pads.  An¬ 
other  disadvantage  is  that  the  model  only  accounts  for  dy¬ 
namic  power  dissipation.  This  approximation  (not  including 
static/leakage  power)  is  valid  with  respect  to  0.25u  technology 
but  may  not  be  valid  for  smaller(e.g.,  0.09u)  technologies. 

The  off-chip  bus  power  is  estimated  using  Spectre  simula¬ 
tion.  The  driver  component  is  modeled  by  a  series  of  inverters. 
The  model  is  designed  using  0.25u  TSMC  library.  The  bus  line 
capacitance  values  are  obtained  from  actual  measurements  of 
a  PCB  with  an  Intel  StrongARMl  110  processor. 

The  total  system  power  is  estimated  using  the  following  ap¬ 
proach.  The  energy  for  the  LI  and  L2  cache  is  obtained  di¬ 
rectly  using  the  Kamble  and  Ghose  model[15,  9].  The  pro¬ 
cessor  power  is  obtained  from  the  Synopsys  Power  Compiler. 
Processor  power  is  multiplied  with  the  execution  time  to  ob¬ 
tain  the  processor  energy.  The  bus  power  obtained  from  the 
Spectre  simulations  is  also  multiplied  with  the  execution  time 
to  obtain  the  energy  of  the  off-chip  bus.  The  energy  from  the 
bus,  processor,  LI  and  L2  cache  are  summed  up  to  get  the  total 
system  power.  The  resulting  measurements  for  our  examples 
are  shown  in  Section  7. 

4  Data  Remapping 

Data  Remapping  is  a  compile  time  technique.  It  is  an  efficient 
remapping  of  an  application’s  data  layout  in  memory  such  that 
data  elements  that  are  accessed  contemporaneously  are  placed 
together  in  memory.  If  a  reference  stream  does  not  exhibit 
address  adjacency,  valuable  resources  are  wasted  as  data  is 
unnecessarily  fetched  and  cached.  The  remapping  technique 
remaps  elements  into  new  sets  such  that  data  items  that  are 
more  likely  to  be  used  together  are  grouped  together  into  the 
same  cache  block. 

The  applications  to  which  data  remapping  can  be  usefully 
applied  are  record  data  type-heavy  and  pointer-heavy  applica¬ 
tions.  Consider  an  example  where  in  a  file  of  records,  a  par¬ 
ticular  field  of  all  records  has  to  be  searched  or  modified.  The 
original  mapping  of  the  data  in  the  memory  will  be  such  that 
fields  belonging  to  a  particular  record  will  be  placed  together. 
If  a  cache  line  is  fetched  then  all  the  data  other  than  that  par¬ 
ticular  field  is  wasted.  Also  the  search  for  the  next  field  will 
lead  to  a  cache  miss.  Instead,  if  all  fields  were  placed  together 
in  the  above  example  then  cache  misses  will  be  reduced.  Also, 
energy  is  not  wasted  in  fetching  data  that  is  not  useful.  The 
remapping  algorithm  is  a  combination  of  field  reordering  and 
customized  placement  to  exhibit  better  spatial  locality. 

The  remapping  optimization  consists  of  three  phases  -  gath¬ 
ering  phase,  remapping  of  global  data  objects  and  remapping 
of  dynamic  data  objects.  In  the  gathering  phase,  an  analy¬ 
sis  of  application  memory  access  patterns  along  program  hot¬ 
spots  [1]  is  performed.  The  remapping  strategy  cannot  be  arbi¬ 
trarily  applied  to  all  data  objects  in  the  program.  It  is  applied 
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based  on  the  analysis  obtained  from  the  gathering  phase.  In  the 
second  phase  global  data  objects  are  remapped.  Once  the  can¬ 
didate  records  have  been  identified,  global  program  variables 
are  filtered  to  isolate  the  arrays  of  records  which  are  remapped. 
The  third  phase  remaps  dynamic  data  objects  (i.e.,  pointer  vari¬ 
ables).  The  third  phase  is  crucial  as  applications  increasingly 
rely  on  dynamically  allocated  objects[3,  5]. 

5  Voltage  and  Frequency  Scaling  of 
Off-Chip  Memory  and  Buses 

The  power  consumed  is  proportional  to  the  square  of  the  volt¬ 
age.  Thus,  reducing  the  voltage  will  lead  to  a  quadratic  reduc¬ 
tion  in  power.  But  when  the  voltage  of  a  component  is  lowered 
it  leads  to  increase  in  delay  which  affects  performance.  The 
off-chip  memory  and  buses  are  highly  capacitive  and  thus  con¬ 
sume  close  to  half  of  the  system  power.  To  reduce  the  overall 
system  power  significantly  we  scale  the  voltage  of  the  off-chip 
memory  and  buses.  In  our  system  the  off-chip  memory  is  an 
L2  cache. 

Figure  2  shows  the  slowing  down  of  L2  memory.  The  orig¬ 
inal  system  runs  at  100MHz  with  the  processor  at  2.75  volts 
and  off-chip  components  (including  bus  and  memory)  at  3.3 
volts.  The  voltage  of  the  off-chip  bus  and  memory  was  scaled 
from  3.3  volts  to  2  volts.  This  causes  the  delay  of  the  off-chip 
memory  and  bus  to  double.  To  take  into  consideration  the  in¬ 
crease  in  delay,  the  frequency  of  the  off-chip  components  was 
scaled  from  100MHz  to  50MHz. 


100  MHz  100  MHz  100  MHz  50  MHz 


Figure  2:  Slowing  down  L2  Memory 

Frequency  scaling  is  achieved  by  simulating  the  off-chip 
components  at  50MHz  instead  of  the  original  100MHz.  The 
D-Cache  and  I-Cache  controllers  were  modified  such  that  they 
fetch  data  from  the  memory  at  50MHz  instead  of  the  previous 
100MHz.  This  is  done  by  doubling  the  latency  of  the  memory 
(in  our  case  the  L2  cache)  from  7  to  14  cycles.  The  voltage  at 
which  power  is  estimated  is  reduced  from  3.3  volts  to  2  volts  to 
simulate  voltage  scaling  in  case  of  the  off-chip  bus  and  mem¬ 
ory. 

6  Design  Space  Exploration 

The  original  system  consists  of  the  processor  and  off-chip 
components  running  at  100MHz.  We  simulate  the  system  us¬ 


ing  two  benchmarks  health  and  perimeter  before  remapping. 
We  call  the  original  system  the  before  case.  The  after  case  is 
where  the  processor  is  simulated  at  100MHz  and  the  off-chip 
components  are  simulated  at  50MHz.  The  health  and  perime¬ 
ter  benchmarks  are  remapped  and  simulated  with  50MHz  L2 
memory  so  that  effect  of  combining  both  the  techniques  can 
be  determined.  Switching  activity  files  are  collected  from  the 
simulations  using  the  MARS  model  and  are  used  to  determine 
the  processor  and  bus  power  in  both  cases.  The  execution 
statistics  from  the  Trimaran  ARM  simulator  is  used  to  deter¬ 
mine  the  power  for  the  LI  cache  and  off-chip  L2  cache. 

The  data  remapping  allows  the  program  to  achieve  the  same 
overall  execution  time  with  half  the  cache  resources.  Since  the 
cache  is  expensive  in  terms  of  both  power  and  cost,  halving 
the  cache  size  would  lead  to  roughly  half  the  cost  and  power 
requirements.  Results  have  been  obtained  by  using  half  the  LI 
and  L2  cache  size. 

The  power  calculations  do  not  include  the  overhead  of  mul¬ 
tiple  supply  voltages  as  we  assume  that  multiple  supply  volt¬ 
ages  are  already  present  in  the  board.  Also  it  is  assumed  that 
voltage  scaling  (i.e.,  changing  the  frequency  of  off-chip  com¬ 
ponents  from  3.3  volts  to  2  volts)  is  done  statically. 

7  Results 

The  energy  savings  from  combining  the  two  techniques  has 
been  shown  for  two  Olden  benchmarks  [13],  namely  perimeter 
and  health.  The  benchmarks  selected  are  such  that  they  are 
suitable  for  remapping.  The  perimeter  allocates  quad  trees  at 
the  program  startup  and  do  not  modify  them.  The  primary 
data  structure  used  in  health  is  a  link  list  to  which  elements  are 
added  and  removed. 

Table  1  shows  the  results  for  the  health  benchmark.  The 
LI  is  a  32KB  cache  with  16  bytes  line  size.  The  L2  is  a  1MB 
cache  with  32  bytes  line  size.  We  find  that  for  the  health  bench¬ 
mark  there  is  a  large  reduction  in  the  execution  cycles,  but  for 
perimeter  the  reduction  in  execution  cycles  is  not  as  much  (see 
Table  2.  Data  remapping  will  cause  an  increase  in  performance 
but  much  of  this  performance  gain  is  lost  due  to  slowing  down 
the  off-chip  memory.  This  is  clearly  seen  in  the  case  of  the 
Health  benchmark.  Also  we  find  that  for  both  benchmarks 
there  is  a  large  decrease  in  the  energy  of  the  L2  cache.  Even 
though  the  processor  power  is  almost  constant,  the  decrease  in 
processor  energy  is  due  to  gains  in  performance  due  to  remap¬ 
ping.  We  are  able  to  achieve  a  maximum  of  46%  energy  gains 
in  the  Health  benchmark.  From  our  experiments,  we  observed 
that  there  is  no  simple  linear  relationship  among  data  remap¬ 
ping,  frequency/voltage  scaling  of  second  level  memory,  en¬ 
ergy  reduction  and  energy  delay  reduction. 

The  remapping  technique  allows  a  program  to  run  with  the 
same  execution  time  but  with  far  less  the  memory.  To  explore 
the  design  space,  we  also  considered  reducing  the  LI  and  L2 
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Table  1:  Energy  Delay  with  Frequency/Voltage  Scaling  of  Memory  (FVM)  and  Data  Remapping(DR)  for  Health  Benchmark 


Before 

DR,  FVM 

After 

DR 

After 

FVM 

After 

DR+FVM 

After 

DR+FVM 

l/2size  LI 

After 

DR+FVM 

l/2size  L2 

After 

DR+FVM 

l/2size  L1,L2 

Execution  Cycles 

803645821 

479612138 

892552982 

578046486 

603275469 

711151104 

736311686 

Delay(Execution  Time)(s) 

8.036 

4.796 

8.926 

5.780 

6.033 

7.112 

7.363 

Energy(J) 

17.076 

9.274 

14.316 

9.274 

9.468 

11.158 

10.134 

Energy*Delay 

137.231 

44.479 

127.778 

53.608 

57.118 

79.350 

74.618 

%  Energy  Reduction 

0.00 

39.33 

16.16 

45.69 

44.55 

34.66 

40.65 

%  Energy*Delay  Reduction 

0.00 

67.59 

6.89 

60.94 

58.38 

42.18 

45.63 

Table  2:  Energy  Delay  with  Frequency/Voltage  Scaling  of  Memory  (FVM)  and  Data  Remapping(DR)  for  Perimeter  Benchmark 


Before 

DR,  FVM 

After 

DR 

After 

FVM 

After 

DR+FVM 

After 

DR+FVM 

l/2size  LI 

After 

DR+FVM 

l/2size  L2 

After 

DR+FVM 

l/2size  L1,L2 

Execution  Cycles 

1065497740 

967549770 

1073983968 

999065267 

999305168 

999095525 

999339410 

Delay(Execution  Time)(s) 

10.655 

9.675 

10.740 

9.991 

9.993 

9.991 

9.993 

Energy(J) 

23.361 

21.648 

17.860 

16.897 

16.414 

14.221 

13.828 

Energy*Delay 

248.911 

209.455 

191.814 

168.812 

164.026 

142.081 

138.189 

%  Energy  Reduction 

0.00 

7.33 

23.55 

27.67 

29.74 

39.13 

40.81 

%  Energy*Delay  Reduction 

0.00 

15.85 

22.94 

32.18 

34.10 

42.92 

44.48 

Table  3:  Energy  results  after  remapping  and  Voltage  Scaling(Ll=32KB,  L2=1MB)  for  Health  Benchmark 


Execution 

Processor 

Off-chip  Bus 

LI  Cache 

L2  Cache 

Total 

%  reductions  Total 

Cycles 

Energy(J) 

Energy(J) 

Energy(J) 

Energy(J) 

Energy(J) 

Energy 

578046486 

6.415 

0.433 

0.3155 

2.104 

9.274 

45.69 
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cache  sizes  to  half  their  original  sizes.  The  last  three  columns 
in  Tables  1  and  2  show  the  energy  results  after  halving  LI 
cache,  L2  cache  and  both  LI  and  L2  cache  respectively.  We 
find  that  as  expected  the  energy  requirements  of  the  cache  also 
reduced  by  half.  In  case  of  the  Perimeter  benchmark  the  ex¬ 
ecution  time  remains  the  same  and  thus  the  energy  saving  in 
the  memory  subsystem  is  reflected  in  the  overall  energy  gains. 
A  maximum  of  40.81%  energy  reduction  is  achieved  in  case 
of  Perimeter  benchmark  when  both  caches  are  reduced  to  half 
their  size.  But  in  case  of  the  Health  benchmark,  reduction  in 
cache  size  leads  to  increase  in  the  execution  time.  Even  though 
the  energy  requirement  of  the  memory  subsystem  is  reduced, 
this  is  not  reflected  in  the  overall  energy  gains  due  to  the  in¬ 
crease  in  execution  cycles.  Thus,  for  the  Health  benchmark, 
the  maximum  energy  reduction  of  45.69%  is  found  with  both 
caches  at  their  original  sizes  (L1=32KB,  L2=1MB). 

8  Conclusion 

There  are  many  techniques  at  both  the  hardware  and  com¬ 
piler  level  aimed  at  saving  energy  and  power.  In  this  work  we 
have  demonstrated  a  combination  of  two  techniques,  one  at  the 
hardware  level  and  one  at  the  compiler  level.  The  main  draw¬ 
back  of  hardware  techniques  is  that  they  tradeoff  power  with 
performance.  In  our  work  by  combining  the  two  techniques, 
we  are  able  to  obtain  energy  gains  without  leading  to  a  per¬ 
formance  loss.  For  future  work,  we  are  looking  at  additional 
architecture  level  techniques  aimed  at  the  memory  subsytem 
(specifically  at  the  cache)  and  processor  where  compiler  and 
hardware  techniques  interact  to  reduce  energy. 
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Power  dissipation  issues  are  becoming  one  of  the 
major  design  issues  in  high  performance  processor 
architectures. 

In  this  paper,  we  study  the  contribution  of  com¬ 
piler  optimizations  to  energy  reduction.  In  particu¬ 
lar,  we  are  interested  in  the  impact  of  loop  optimiza¬ 
tions  in  terms  of  performance  and  power  tradeoffs. 
Both  low-level  loop  optimizations  at  code  generation 
(back-end)  phase,  such  as  loop  unrolling  and  soft¬ 
ware  pipelining ,  and  high-level  loop  optimizations  at 
program  analysis  and  transformation  phase  (  front- 
end),  such  as  loop  permutation  and  tiling ,  are  stud¬ 
ied. 

In  this  study,  we  use  the  Delaware  Power- Aware 
Compilation  Testbed  (Del-PACT)  —  an  integrated 
framework  consisting  of  a  modern  industry-strength 
compiler  infrastructure  and  a  state-of-the-art  micro- 
architecture-level  power  analysis  platform.  Using 
Del-PACT,  the  performance/power  tradeoffs  of  loop 
optimizations  can  be  studied  quantitatively.  We  have 
studied  such  impact  on  several  benchmarks  under 
Del-PACT. 

The  main  observations  are: 

•  Performance  improvement  (in  terms  of  timing) 
is  correlated  positively  with  energy  reduction. 

•  The  impact  on  energy  consumption  of  high- 
level  and  low-level  loop  optimizations  is  often 


closely  coupled,  and  we  should  not  evaluate  in¬ 
dividual  effects  in  complete  isolation.  Instead,  it 
is  very  useful  to  assess  the  combined  contribu¬ 
tion  of  both,  high-level  and  low-level  loop  opti¬ 
mizations. 

•  In  particular,  results  of  our  experiments  are 
summarized  as  follow: 

-  Loop  unrolling  reduces  execution  time 
through  effective  exploitation  of  ILP  from 
different  iterations  and  results  in  energy 
reduction. 

-  Software  pipelining  may  help  in  reducing 
total  energy  consumption  -  due  to  the  re¬ 
duction  of  the  total  execution  time.  How¬ 
ever,  in  the  two  benchmarks  we  tested, 
the  effects  of  high-level  loop  transforma¬ 
tion  cannot  be  ignored.  In  one  bench¬ 
mark,  even  with  software  pipelining  dis¬ 
abled,  applying  proper  high-level  loop 
transformation  can  still  improve  the  over¬ 
all  execution  time  and  energy,  comparing 
with  the  scenario  where  high-level  loop 
transformation  is  disabled  though  soft¬ 
ware  pipelining  is  applied. 

-  Some  high-level  loop  transformation  such 
as  loop  permutation,  loop  tiling  and  loop 
fusion  contribute  significantly  to  energy 
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reduction.  This  behavior  can  be  attributed 
to  reducing  both  the  total  execution  time 
and  the  total  main  memory  activities  (due 
to  improved  cache  locality). 

An  analysis  and  discussion  of  our  results  is  pre¬ 
sented  in  section  4. 

1  Introduction 

Low  power  design  and  optimization  [8]  are  becom¬ 
ing  increasingly  important  in  the  design  and  appli¬ 
cation  of  modern  microprocessors.  Excessive  power 
consumption  has  serious  adverse  effects  -  for  exam¬ 
ple,  the  usefulness  of  a  device  or  equipment  is  re¬ 
duced  due  to  the  short  battery  life  time. 

In  this  paper,  we  focus  on  compiler  optimization 
as  a  key  area  in  low-power  design  [7,  13].  Many  tra¬ 
ditional  compiler  optimization  techniques  are  aimed 
at  improving  program  performance  such  as  reducing 
the  total  program  execution  time.  Such  performance- 
oriented  optimization  may  also  help  to  save  total  en¬ 
ergy  consumption  since  a  program  terminates  faster. 
But,  things  may  not  be  that  simple.  For  instance, 
some  of  such  optimization  my  try  to  improve  per¬ 
formance  by  exploiting  instruction-level  parallelism, 
thus  increasing  power  consumption  per  unit  time. 
Other  optimization  may  reduce  total  execution  time 
without  increasing  power  consumption.  The  trade¬ 
offs  of  these  optimizations  remain  an  interesting  re¬ 
search  area  to  be  explored. 

In  this  study,  we  are  interested  in  the  impact 
of  loop  optimizations  in  terms  of  performance  and 
power  tradeoffs.  Both  low-level  loop  optimizations 
at  code  generation  (back-end)  phase,  such  as  loop 
unrolling  and  software  pipelining ,  and  high-level 
loop  optimizations  at  program  analysis  and  transfor¬ 
mation  phase  (front-end),  such  as  loop  permutation 
and  tiling ,  are  studied. 


Since  both  high-level  and  low-level  optimization 
are  involved  in  the  study,  it  is  critical  that  we  should 
use  a  experimental  framework  where  such  tradeoff 
studies  can  be  conducted  effectively.  We  use  the 
Delaware  Power- Aware  Compilation  Testbed  (Del- 
PACT)  —  an  integrated  framework  consisting  of 
a  modem  industry-strength  compiler  infrastructure 
and  a  state-of-the-art  micro-architecture  level  power 
analysis  platform.  Using  Del-PACT,  the  perfor- 
mance/power  tradeoffs  of  loop  optimizations  can  be 
studied  quantitatively.  We  have  studied  the  such  im¬ 
pact  on  several  benchmarks  under  Del-PACT. 

This  paper  describes  the  motivation  of  loop  op¬ 
timization  on  program  performance/power  in  Sec¬ 
tion  2  and  describing  the  Del-PACT  platform  in  Sec¬ 
tion  3.  The  results  of  applying  loop  optimization  on 
saving  energy  are  given  in  Section  4.  The  conclu¬ 
sions  are  drawn  in  Section  5. 

2  Motivation  for  Loop  Optimization 
to  Save  Energy 

In  this  section  we  use  some  examples  to  illustrate  the 
loop  optimizations  which  are  useful  for  energy  sav¬ 
ing.  Both  low-level  loop  optimizations  at  the  code 
generation  (back-end)  phase,  such  as  loop  unrolling 
and  software  pipelining ,  and  high-level  loop  opti¬ 
mizations  at  the  program  analysis  and  transformation 
phase  (front-end),  such  as  loop  permutation ,  loop  fu¬ 
sion  and  loop  tiling ,  are  discussed. 

2.1  Loop  unrolling 

Loop  unrolling  [17]intends  to  increase  instruction 
level  parallelism  of  loop  bodies  by  unrolling  the  loop 
body  multiple  times  in  order  to  schedule  several  loop 
iterations  together.  The  transformation  also  reduces 
the  number  of  times  loop  control  statements  are  exe¬ 
cuted. 
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2.2  Software  pipelining 


Software  pipelining  restructures  the  loop  kernel  to 
increase  the  amount  of  parallelism  in  the  loop,  with 
the  intent  of  minimizing  the  time  to  completion. 
In  the  past,  resource-constrained  software  pipelin¬ 
ing  [10,  16]  has  been  studied  extensively  by  several 
researchers  and  a  number  of  modulo  scheduling  al¬ 
gorithms  have  been  proposed  in  the  literature.  A 
comprehensive  survey  of  this  work  is  provided  by 
Rau  and  Fisher  in  [15].  The  performance  of  soft¬ 
ware  pipelined  loop  is  measured  by  II(  initiation  in¬ 
terval).  Every  II  cycles  a  new  iteration  is  initiated, 
thus  throughput  of  the  loop  is  often  measured  by  the 
value  of  II  derived  from  the  schedule.  By  reducing 
program  execution  time,  software  pipelining  helps 
reduce  the  total  energy  consumption.  But,  as  we  will 
show  later  in  the  paper,  the  net  effect  of  energy  con¬ 
sumption  due  to  software  pipelining  also  depends  on 
high-level  loop  transformations  performed  earlier  in 
the  compilation  process. 

2.3  Loop  permutation 

Loop  permutation  (also  called  loop  interchange  for 
two  dimensional  loops)  is  a  useful  high-level  loop 
transformation  for  performance  optimization  [19]. 
See  the  following  C  program  fragment: 

for  (i  =  0;  i  <  M;  i++)  { 
for  (j  =  0;  j  <  N;  j++)  { 
a [ j ]  [i]  =  1; 

} 

} 


Since  the  array  a  is  placed  by  row-major  mode, 
the  above  program  fragment  doesn’t  have  good  cache 
locality  because  two  successive  references  on  array  a 
have  a  large  span  in  memory  space.  By  switching  the 
inner  and  outer  loop,  the  original  loop  is  transformed 
into: 

for  (j  =  0;  j  <  N;  j++)  { 


for  (i  =  0;  i  <  M;  i++)  { 
a [  j ]  [i]  =  1; 

} 

} 

Note  that  the  two  successive  references  on  array  a 
access  contiguous  memory  address  thus  the  program 
exhibits  good  data  cache  locality.  It  usually  improves 
both  the  program  execution  and  power  consumption 
of  data  cache. 

2.4  Loop  tiling 

Loop  tiling  is  a  powerful  high-level  loop  optimiza¬ 
tion  technique  useful  for  memory  hierarchy  opti¬ 
mization  [14].  See  the  matrix  multiplication  program 
fragment: 

for  (i  =0;  i  <  N;  i++)  { 
for  (j  =  0;  j  <  N;  j++)  { 
for  (k  =  0;  k  <  N;  k++)  { 
c [i] [j]  =  c[i]  [j]  + 
a [i]  [k]  *  b[k]  [j]; 

} 

} 

} 

Two  successive  references  to  the  same  element  of 
a  are  separated  by  N  multiply-and-sum  operations. 
Two  successive  references  to  the  same  element  of 
b  are  separated  by  N 2  multiply-and-sum  operations. 
Two  successive  references  to  the  same  element  of  c 
are  separated  by  1  multiply-and-sum  operation.  For 
the  case  when  N  is  large,  references  to  a  and  b  ex¬ 
hibits  little  locality  and  the  frequent  data  fetching 
from  memory  results  in  high  power  consumption. 
Tiling  the  loop  will  transforme  it  to: 

for  (i  =  0;  i  <  N;  i+=T)  { 
for  (j  =  0;  j  <  N;  j+=T)  { 
for  (k  =  0;  k  <  N;  k+=T)  { 

for  (ii  =  i;  ii  <  min(i+T,  N) ;  ii++)  { 

for  (jj  =  j;  jj  <  min ( j+T,  N) ;  jj++)  { 

for  (kk  =  k;  kk  <  min(k+T,  N) ;  kk++)  { 
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c [ii]  [ jj]  =  c [ii]  [  j j]  + 
a [ii] [kk]  *  b[kk] [jj]; 

} 

} 

} 

} 

} 

} 

Notice  that  in  the  inner  three  loop  nests,  we  only 
compute  a  partial  sum  of  the  resulting  matrix.  When 
computing  this  partial  sum,  two  successive  refer¬ 
ences  to  the  same  element  of  a  are  separated  by  T 
multiply-and-sum  operations.  Two  successive  refer¬ 
ences  to  the  same  element  of  b  are  separated  by  T2 
multiply-and-sum  operations.  Two  successive  refer¬ 
ences  to  the  same  element  of  c  are  separated  by  one 
multiply-and-sum  operation.  A  cache  miss  occurs 
when  the  program  execution  re-enter  the  inner  three 
loop  nests  after  /,  j  or  k  is  incremented.  However, 
cache  locality  in  the  inner  three  loops  is  improved. 

Loop  tiling  may  have  dual  effects  in  improving  to¬ 
tal  energy  consumption:  it  reduces  both  the  total  ex¬ 
ecution  time  and  the  cache  miss  ratios  -  both  help 
energy  reduction. 

2.5  Loop  fusion 

See  the  following  program  fragment: 

for  (i  =  0;  i  <  N;  i++)  { 
a  [i]  =  1; 

} 

for  (i  =  0;  i  <  N;  i++)  { 
a  [i]  =  a  [i]  +  1; 

} 

Two  successive  references  to  the  same  element  of 
a  span  the  whole  array  a  in  the  code  above.  By  fusing 
the  two  loops  together,  we  can  get  the  following  code 
fragment: 

for  (i  =  0;  i  <  N;  i++)  { 


a  [i]  =  1; 
a  [i]  =  a  [i]  +  1; 

} 

The  transformed  code  has  much  better  cache  lo¬ 
cality.  Just  like  loop  tiling,  this  transformation  will 
reduce  both  power  and  energy  consumption. 

3  Power  and  Performance  Evalua¬ 
tion  Platform 

It  is  clear  that,  for  the  purpose  of  this  study,  we  must 
use  a  compiler/simulator  platform  which  (1)  is  ca¬ 
pable  of  performing  loop  optimizations  at  both  the 
high-level  and  the  low-level,  and  a  smooth  integra¬ 
tion  of  both;  (2)  is  capable  of  performing  micro¬ 
architecture  level  power  simulations  with  a  quanti¬ 
tative  power  model. 

To  this  end,  we  chose  to  use  the  Del-PACT( 
Delaware  Power- Aware  Compilation  Testbed  )  -  a 
fully  integrated  framework  composed  of  SGI  MIP- 
Spro  compiler  retargeted  to  the  SimpleScalar  [1]  in¬ 
struction  set  architecture,  and  a  micro-architecture 
level  power  simulator  based  on  an  extension  of  the 
SimpleScalar  architecture  simulator  instrumented 
with  the  Cai/Lim  power  model  [5,  4],  as  shown  in 
Figure  1.  The  SGI  MIPSpro  compiler  is  an  industry- 
strength  highly  optimizing  compiler.  It  implements 
a  broad  range  of  optimizations,  including  inter¬ 
procedural  analysis  and  optimization  (IPA),  loop  nest 
optimization  and  parallelization  (LNO)  [18],  and 
SSA-based  global  optimization  (WOPT)  [2,  11]  at 
high  level.  It  also  has  an  efficient  backend  includ¬ 
ing  software  pipelining,  integrated  global  and  local 
scheduler(IGLS)  [12],  and  efficient  global  and  reg¬ 
ister  allocators  (GRA  and  LRA)  [3].  The  legality 
of  loop  nest  optimizations  listed  in  Section  2  de¬ 
pends  on  dependence  analysis  [20].  The  SGI  MIP¬ 
Spro  compiler  performs  alias  and  dependence  anal¬ 
ysis  and  a  rich  set  of  loop  optimizations  including 
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those  we  will  study  in  the  paper.  We  have  ported 
the  MIPSpro  compiler  to  the  SimpleScalar  instruc¬ 
tion  set  architecture. 


Source  Program 


Performance  Power 

Results  Results 


Figure  1:  Power  and  Performance  Evaluation  Plat¬ 
form 

The  simulation  engine  of  the  Del-PACT  platform 
is  driven  by  the  Cai/Lim  power  model  as  shown  in 
the  same  diagram.  The  instrumented  SimpleScalar 
simulator  generates  performance  results  and  activity 
counters  for  each  functional  block.  The  physical  in¬ 
formation  comes  from  approximation  of  circuit  level 
power  behaviors.  During  each  cycle,  the  parameter¬ 
ized  power  model  computes  the  present  power  con¬ 
sumption  of  each  functional  unit  using  the  following 
formula: 

power  =  AF  *  PDA  *  A  +  idle  power  +  leakage  power 


AF  Activity  factor 
PDA  Active  power  density 
A  Area 

The  power  consumption  of  all  functional  blocks  is 
summed  up,  thus  obtaining  the  total  power  consump¬ 
tion. 

Other  power/performance  evaluation  platforms 
exist  as  well.  A  model  worth  mentioning  is  Simple- 
Power  [9].  In  [9]  loop  transformation  techniques  are 
evaluated.  In  their  framework,  high-level  transfor¬ 
mation  and  low-level  loop  transformations  are  per¬ 
formed  in  two  isolated  compilers  while  in  our  plat¬ 
form  these  two  are  tightly  coupled  into  a  single  com¬ 
piler.  The  difference  between  these  two  power  mod¬ 
els  are  left  to  be  studied  and  a  related  work  is  found 
in  [6]. 

4  Experimental  Results 

In  this  section,  we  present  the  experiments  we  have 
conducted  using  Del-PACT  platform.  Two  bench¬ 
mark  programs:  mxm  and  vpenta  from  the  SPEC 92 
floating  point  benchmark  suite  are  used.  We  eval¬ 
uated  the  impact  on  performance/power  of  loop 
nest  optimizations,  software  pipelining  and  loop  un¬ 
rolling.  Loop  nest  optimization  is  a  set  of  high- 
level  optimizations  that  includes  loop  fusion,  loop 
fission,  loop  peeling,  loop  tiling  and  loop  permu¬ 
tation.  The  MIPSpro  compiler  analyzes  the  com¬ 
piled  program  by  determining  the  memory  access  se¬ 
quence  of  loops,  choosing  those  loop  nest  optimiza¬ 
tions  which  are  legal  and  profitable.  Looking  through 
the  transformed  code,  we  see  that  the  loop  nest  op¬ 
timizations  applied  on  mxm  is  loop  permutation  and 
loop  tiling,  while  those  applied  on  vpenta  are  loop 
permutation  and  loop  fusion.  Performance,  power 
and  energy  results  of  these  transformations  on  each 
benchmark  are  shown  in  Figure  2. 
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Figure  2:  Performance,  power  and  energy  compari¬ 
son 

We  observe  that  the  performance  improvement  in 
terms  of  timing  is  correlated  positively  with  the  en¬ 
ergy  reduction.  From  Figure  2  we  see  the  variation 
of  execution  time  causes  the  similar  variation  in  en¬ 
ergy  consumption.  The  results  show  that  in  the  two 
benchmarks  we  have  run,  the  dominating  factor  of 
energy  consumption  is  the  execution  time. 

Loop  unrolling  improves  the  program  execution 
by  increasing  instructions  level  parallelism  thus  in¬ 
creasing  power  consumption  correspondingly.  For 
the  mxm  example,  the  instructions  per  cycle(IPC)  in¬ 
creased  from  1.68  to  1.8  by  unrolling  4  times.  For 
the  vpenta  example,  loop  unrolling  reduces  the  to¬ 
tal  instruction  count  by  2%  because  of  cross-iteration 
common  subexpressions  elimination.  The  IPC  value 
before  the  loop  unrolling  and  after  that  are  1.01  and 
1.04  respectively. 

Software  pipelining  helps  in  reducing  total  energy 
consumption  in  the  mxm  example.  The  IPC  value 
without  and  with  software  pipelining  are  1.68  and 
1.9  respectively.  Power  consumption  increase  a  lit¬ 
tle  bit  more  as  opposed  to  the  case  with  loop  un¬ 
rolling  because  software  pipelining  exploits  more  in¬ 
struction  level  parallelism  than  loop  unrolling  does. 
However,  energy  consumption  is  still  reduced  com¬ 


pared  with  the  original  untransformed  program.  In 
vpenta  example,  software  pipelining  does  not  help 
because  of  the  high  miss  rate(13%)  of  level-1  data 
cache  accesses. 

Loop  tiling  and  loop  permutation  applied  on  mm 
enhanced  cache  locality  and  they  can  improve  the 
program  performance  more  than  software  pipelin¬ 
ing  does.  Loop  permutation  and  loop  fusion  help 
the  vpenta  program  reduce  its  level- 1  data  cache 
miss  rate  from  13%  to  10%,  thus  reducing  total  en¬ 
ergy  consumption.  Also  these  transformations  make 
the  performance  improvement  of  software  pipelining 
more  evident  compared  with  the  case  that  software 
pipelining  is  applied  without  these  high-level  opti¬ 
mizations. 

5  Conclusions 

In  this  paper,  we  introduced  our  Del-PACT  plat¬ 
form,  which  is  an  integrated  framework  that  includes 
the  MIPSpro  compiler,  SimpleScalar  simulator  and 
CAI/LIM  power  estimator.  This  platform  can  serve 
as  the  tool  to  make  architecture  design  tradeoffs,  and 
to  study  the  impact  of  compiler  optimization  on  pro¬ 
gram  performance  and  power  consumption.  We  use 
this  platform  to  conduct  experiments  on  the  impact 
of  loop  optimizations  on  program  performance  vs 
power. 
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ABSTRACT 

This  paper  describes  an  efficient  hierarchical  design  and  optimization 
approach  for  ultra-low  power  CMOS  logic  circuits.  We  introduce  the 
Hierarchical  Activity-Aware  Time  Slack  Distribution  (HA2TSD) 
algorithm,  which  distributes  the  surplus  time  slack  into  the  most 
power-hungry  modules  hierarchically.  HA2TSD  ensures  that  the  total 
slack  budget  is  maximal  and  the  total  power  is  near-minimal.  Based 
on  these  time  slacks,  we  have  optimized  technology  parameters 
(supply  voltage,  threshold  voltage,  and  device  width)  through  a  gate- 
level  power  optimizer  and  have  tested  the  algorithm  on  a  set  of 
benchmark  example  circuits  and  building  blocks  of  a  synthesizable 
ARM  core.  The  experimental  results  show  that  our  strategy  delivers 
over  an  order  of  magnitude  savings  in  total  (static  and  dynamic) 
power  and  reduces  the  optimization  run-time  significantly. 

Categories  and  Subject  Descriptors 

B.7.2  [Integrated  Circuits]:  Design  Aids-simulation. 

General  Terms 

Algorithms. 

Keywords 

Low-power  design,  time  slack  distribution,  and  gate-level  power 
optimization. 


1.  INTRODUCTION 

Recent  advances  in  wireless  networking  technology  and  the  rapid 
development  of  semiconductor  technology  have  introduced  new 
challenges  in  the  design  of  portable  devices  such  as  personal  digital 
assistants  (PDAs).  Power  optimization  for  those  embedded  systems 
and  power  constrained  mobile  computing  is  an  active  area  of  research 
that  has  received  considerable  attention  in  most  recent  years.  Delay, 
area  and  power  trade-offs  for  complex  systems  require  the  use  of 
advanced  algorithms  and  EDA  tools.  To  achieve  excellent  power  and 
performance  results,  future  EDA  tools  must  harness  the  combination 
of  technology  parameters,  i.e.,  multiple  supply  voltages  (Vdd), 
multiple  threshold  voltages  (Vth),  and  transistor  resizing  (W).  By 
combining  the  optimization  strategy  with  the  on-the-fly  technology 
parameter  scaling,  designers  and  EDA  tools  can  fully  explore  the 
design  space  of  dynamic  power,  static  power,  and  timing  slack  [1,2]. 

In  general,  low-power  optimizations  that  do  not  compromise 
Permission  to  make  digital  or  hard  copies  of  all  or  part  of  this  work  for 
personal  or  classroom  use  is  granted  without  fee  provided  that  copies  are 
not  made  or  distributed  for  profit  or  commercial  advantage  and  that 
copies  bear  this  notice  and  the  full  citation  on  the  first  page.  To  copy 
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requires  prior  specific  permission  and/or  a  fee. 
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performance  are  dependent  on  time  slack  calculation  and  the  surplus 
delay  (slack  budget)  distribution  among  the  circuit  modules.  Time 
slack  is  measured  as  the  difference  between  the  signal  required  time 
and  the  signal  arrival  time  at  the  primary  output  of  each  module.  The 
first  use  of  the  slack  distribution  approach  was  reported  by  the 
popular  zero-slack  algorithm  (ZSA)  [3].  The  ZSA  is  a  greedy 
algorithm  that  assigns  slack  budgets  to  nets  on  long  circuit  paths.  It 
ensures  that  the  net  slack  budget  is  maximal,  which  means  that  no 
more  slack  budget  can  be  assigned  to  any  of  the  nets  without 
violating  the  path  timing  constraints.  Most  other  slack  distribution 
algorithms  are  pruning  versions  of  ZSA  [4,5]  for  improving  delay 
performance  of  circuits.  However,  the  objective  of  the  timing 
analysis  in  this  paper  is  to  provide  a  low-power  methodology  that 
maintains  the  high  speed  of  circuits.  The  HA2TSD  algorithm  is 
different  from  the  ZSA  in  three  principal  aspects:  i)  time  slack 
distribution  of  each  module  is  based  on  power  rather  than 
performance  metrics;  ii)  the  slack  distribution  is  performed 
hierarchically,  and  iii)  the  technology  parameters  of  each  module  are 
optimized  at  the  gate  level. 


System  level  (ARM  Core) 
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Multiplier 
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Hierarchical  Delay  Assignment  for  Ultra-low  Power 
(Objective:  Assign  more  time  to  more  Switching  Activity  module) 


Gate  Level  Technology  Parameter  Optimization 
(Objective:  Vdd,  Vth,  W  Scaling  to  minimize  total  Power) 


Figure  1.  Hierarchical  Delay  Assignment  and  Gate  Level  Power 
Optimization 


2.  DELAY  AND  ENERGY  MODEL 

We  use  a  transregional  model  for  estimating  the  worst-case  signal 
propagation  delay  through  a  gate.  The  delay  model  has  been  derived 
using  an  extension  of  the  alpha-power  law  saturation  drain  current 
model  [7]  to  the  subthreshold  region.  The  drain  current  model 
incorporates  effects  of  high-field  and  quasi-ballistic  (velocity 
overshoot)  carrier  transport  in  the  MOSFET  channel.  The  delay 
model  consists  of  four  major  components:  1)  the  delay  due  to 
switching  MOSFETs,  2)  the  distributed  interconnect  RC  delay,  3)  the 
time  of  flight  delay,  4)  the  delay  component  due  to  the  non-zero  rise 
time  of  the  input  signal  are  considered.  These  definitions  of  gate 
delay  and  interconnect  resistance  delay  allow  the  definition  of  arrival 
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times  and  required  times  at  the  input  and  output  of  a  gate  in  the 
network,  which  are  used  for  defining  time  slack. 
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In  the  above  equation,  Vdd  is  the  power  supply  voltage,  tdv  is  the 
delay  of  gate  Gv,  VTSi  is  the  threshold  voltage  of  the  zth  gate,  a  is  the 
velocity  saturation  coefficient  (1  <  a  <  2),  t^/  v  is  the  delay  of  the  gate 
Gv  at  the  zth  fan-in,  tdv>j  is  the  delay  of  the  gate  Gv  at  the  yth  fan-out, 
IDvw  (fa)  is  the  switching  drain  current  per  unit  width,  fti  is  the  number 
of  fanins ,foi  is  the  number  of  fanouts,  (5  is  the  pMOSto  nMOS  width 
ratio  (J3>  1),  IQff  is  the  off  current  per  unit  width,  CDPv  is  the  sum  of 
the  overlap,  junction  and  fmging  capacitance  at  the  output  node  per 
unit  width,  wv  is  the  device  width,  adjusting  wv  scales  the  widths  of 
all  the  transistors  in  Gv  (wv  >1),  wvj  is  the  device  width  the  gate  at  the 
yth  fan-out  (wg  >1),  Ctvj  is  the  input  capacitance  per  unit  width  of  the 
gate  being  driven  by  the  yth  fan-out,  CINTvj  is  the  interconnect 
capacitance  at  the  yth  fan-out,  RjNTvj  is  the  interconnection  resistance 
at  the  yth  fan-out,  L,NTvjis  the  interconnection  length  at  the  yth  fan-out, 
vINT  is  the  propagation  velocity  through  the  interconnect,  Cmv  is  the 
intermediate  node  capacitance  of  series  connected  MODFET’s  in 
multiple  fan-in  gates,  fc  is  the  clock  frequency,  r|v  activity  factor  of 
the  gate  output,  and  Ksc  is  the  coefficient  for  short-circuit  dissipation 
[8].  The  models  are  described  in  detail  in  our  previous  work  [6]. 

The  equations  used  to  compute  the  dynamic  and  static  energy 
dissipations  of  a  gate  are  described  next.  Similar  models  have  been 
presented  and  analyzed  in  a  recent  work  by  [8].  It  is  assumed  that  the 
gates  are  simple  multi-input  gates  with  symmetric  series  or  parallel 
pull-up  and  pull-down  MOSFET  configurations.  Contributions  of 
subthreshold  leakage  through  the  MOSFET  channel  as  well  as  the 
leakage  across  the  device  drain  junctions  to  static  dissipation  are 
included. 

1)  Static  Dissipation  of  Gate  Gv  (v  e  N): 

Esial<=  VddWvIoff/fc  (5) 


2)  Dynamic  and  Short-Circuit  Dissipation  of  Gv 

E Dynamicv  ~  Rv^dd  ^Sc) 
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3.  PREVIOUS  WORK 

Supply  voltage  scaling  technique  for  low  power  has  been  investigated 
in  almost  all  levels  of  the  design  hierarchy  from  system  level  to 
device  level  due  to  the  quadratic  effect  on  the  switching  power 
component.  Many  respective  researches  have  been  shown  up  in 
literature  [1].  However,  it  does  not  come  without  penalties  [9].  The 
scaling  limitations  of  Vdd  reduction  are:  1)  Delay  increase 
(performance  requirements  impose  a  limit);  and  2)  Noise  margins 
decrease  (circuit  is  more  susceptible  to  noise  related  soft  failures). 
The  approaches  to  overcome  the  extent  of  Vdd  scaling  are:  1) 
Availability  of  high-efficiency  DC-DC  converter  for  use  [10];  2) 
Scaling  down  the  dimensions  of  devices  along  with  Vdd  to 
compensate  for  the  effects  of  Vdd  on  performance;  and  3)  Reduction 
of  the  threshold  voltage  of  transistors. 


Threshold  voltage  scaling  can  be  used  to  compensate  the  performance 
penalty  of  the  Vdd  reduction.  In  addition,  for  the  active  mode  of 
operation,  the  low  Vth  is  preferred  because  of  the  higher  performance. 
However,  for  the  standby  mode,  high  Vth  is  useful  for  reduction  of 
leakage  power.  Different  threshold  voltages  can  be  developed  by 
multiple  Vth  implantation  during  the  fabrication,  by  changing  the 
substrate  and  source  bias,  by  controlling  the  back  gate  of  double-gate 
SOI  (silicon  on  insulator)  devices  [10].  Some  techniques  in  literature 
are:  1)  SATS  (self  adjusting  threshold  voltage  scheme)  [11];  2) 
MTCMOS  (multi-threshold  voltage  CMOS)  [12];  3)  DTMOS 
(dynamic  threshold  voltage  MOSFET)  [13];  and  4)  DGDT-SOI 
(double  fate  dynamic  threshold  control  SOI)  [14].  In  general,  the 
threshold  voltage  is  a  function  of  a  number  of  parameters  including 
the  following:  1)  Gate  conductor,  2)  Gate  insulation  material,  3)  Gate 
insulator  thickness-channel  doping,  4)  Impurities  at  the  silicon- 
insulator  interface,  and  5)  Voltage  between  the  source  and  the 
substrate. 

Transistor  and  gate  sizing  affects  for  dynamic  and  leakage  power 
reduction  and  delay.  A  large  gate  is  required  to  drive  a  large  load 
capacitance  with  acceptable  delay  but  requires  more  power.  The  basic 
rule  is  to  use  the  smallest  transistors  or  gates  that  satisfy  the  delay 
constraints.  To  reduce  dynamic  power,  the  gates  that  toggle  with 
higher  frequency  should  be  made  smaller.  An  interesting  problem 
occurs  when  the  sizing  goal  is  to  leakage  power  of  a  circuit.  The 
leakage  current  of  a  transistor  increases  with  decreasing  threshold 
voltage  and  channel  length.  In  general,  a  lower  threshold  or  shorter 
channel  transistor  can  provide  more  saturation  current  and  thus  offers 
a  faster  transistor.  This  presents  a  tradeoff  between  leakage  power 
and  delay.  There  have  been  a  number  of  optimization  algorithms  for 
gate  sizing  for  dozens  of  years  [15]. 

Figure  2  presents  the  fundamental  characteristics  of  those  three 
device  parameters  (Vdd,  Vth,W)  for  power  and  delay  tradeoffs  [2]. 
Figure  2(a)  shows  the  Vdd/Vth  and  Delay*Energy  tradeoffs.  It  shows 
that  the  supply  voltage  should  be  larger  than  four  times  of  the 
threshold  voltage  if  the  delay  is  not  to  increase  excessively.  Figure 
2(b)  shows  the  Device  Width  and  Delay*Energy  tradeoffs.  It  is 
shown  that  the  delay  decreases  with  increase  device  width  but  the 
delay-energy  product  is  minimized  when  the  devices  contribute  half 
of  the  total  load  capacitance.  The  technology  parameters  trade-offs 
are  summarized  in  Figure  2(c).  In  this  paper,  we  try  to  optimize  the 
non-linear  parameters  of  those  tradeoffs  efficiently  to  minimize  the 
total  power. 


(a)  Vdd,  Vth,  Energy,  Delay  (b)  Device  Width  Energy,  Delay 

Tradeoffs  [2]  Tradeoffs  [2] 

Vdd  t  C)  Delay  j  Power  2f  Vth  f  □)  Delay  j  Power  j, 

Vdd  i  □)  Delay  f  Power 2 1  Vth  [  □)  Delay  [  Power  | 

W  f  □)  Delay  [  Power  f 
W||l)  Delay  |  Power  f 

(c)  Technology  Scaling  Rationale 

To  minimize  Total  Power: 

i)  More  surplus  delay  to  more  power  consumption  modules/gates 
ii)  Simultaneously  Vdd,  Vth,  W  Scaling 

Figure  2.  Technology  Optimization  Rationale 
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4.  PROPOSED  APPROACH 

The  key  steps  of  our  approach  are  shown  in  Figure  3.  First 
hierarchical  circuit  partitioning  is  performed.  Then,  beginning  with 
the  topmost  level  of  the  design  hierarchy,  delay  values  are  assigned  to 
every  module  at  that  level.  The  total  delay  from  PI  to  PO  is  given. 
The  problem  is  to  determine  the  delays  of  the  individual  modules  so 
that  total  power  consumption  can  be  minimized  by  optimizing  the 
supply  voltage,  threshold  voltage  and  device  sizes  of  module  Mj  for 
the  assigned  delay  values.  The  procedure  is  repeated  hierarchically. 
We  use  the  following  heuristic  to  assign  delays  to  each  module. 

Heuristic:  In  a  given  data  flow  graph  of  Mj  modules,  let 
C.  =  Yj  rj  c  summation  of  the  product  of  the  activity  7)i  at 

node  i 

node  i  and  the  capacitance  C{  at  node  i  over  all  nodes  i  of  the  module 

Mj.  If  the  delay  assigned  to  module  Mj  is  Dj,  then  the  best  delay 
assignment  for  minimizing  power  is  obtained  when 


Mapping  into 
Graph  Theory 


i)  Generate  topological  level 
(depth)  for  each  module  by  using 
labeling  algorithm 


ii)  Define  max  depth  and  max  number  of  nodes  inside  of  each  module 

iii)  Generate  partition  of  each  module  so  as  to  minimize  skew  for 


given  depth  and  size 


A  =  A  =  =  A 

Ci  C2  Cj 

It  is  clear  that  such  an  assignment  of  delay  to  each  Mj  can  cause  the 
overall  path  delay  constraint  (sum  of  delays  assigned  to  each  module) 
to  be  violated  for  some  of  the  paths  in  the  module.  Therefore,  the 
iterative  HA2TSD  algorithm  is  used  to  solve  the  problem.  This  is 
described  below. 


Hierarchical  Circuit  Partitioning 


Iterative  hierarchical  time  slack  distribution 
for  power.  At  each  level  of  design  hierarchy, 
the  total  execution  time  budget  is  allocated 
to  each  module.  The  execution  times  are 
solved  in  a  top-down  manner. 


Gate-level  technology  optimization 
for  low  power 


Figure  3.  Power  Optimization  Procedure 

4.1  Topological  Depth-Based  Partitioning 

For  simulation  run-time  efficiency  and  power  optimization 
effectiveness,  we  introduce  a  circuit  partitioning  algorithm  which 
ensures  the  minimization  of  the  delay  skew  between  sub-modules, 
and  constrains  maximum  sub-module  size  (or  fan-out  size).  Figure  4 
gives  conceptual  overview  of  the  topological  depth-based  partitioning. 
First  of  all,  labeling  of  each  circuit  node  is  conducted  according  to 
the  topological  order.  Then,  according  to  the  maximum  depth  and 
maximum  size  constraints,  the  whole  flattened  gate-level  digital 
circuit  is  partitioned  into  sub-module  circuits.  The  detailed  algorithm 
for  the  partitioning  is  shown  in  Figure  5.  The  complexity  of  this 
algorithm  is  0(bm),  where  b  is  the  branching  factor  (i.e.,  average  fan¬ 
out  number)  and  m  is  maximum  topological  depth. 


Figure  4.  Partitioning  Overview 


HA2TSD  Algorithm  1 :  Partitioning 


Input:  Directed  acyclic  graph  G  =  (V,E) 

Output:  Partitioned  sub-graphs  G;  (Vj}  E-t) ,  i=  2  ...  n 

Begin 

Phase  0:  Initialization 

Initialize  the  class  variable  of  all  nodes  V  in  G ; 

(level  (topological  order)  =  w ) 

Phase  I:  Labeling  {Identify  topological  order} 

Put  a  start  node  to  FIFO  Queue  q; 

Initialize  the  level  of  the  start  node  =  0; 

While  (q  Is  not  0) 

{ 

Obtain  a  reference  to  the  first  element(x)  in  q; 

Remove  node  x  from  q; 

For  each  fan-out  node  y  of  node  x 

{ 

If  level  of  y-  oo  then 

If  number  of  fan-in  node  ofy=1  then 
level  ofy  =  level  ofx  +1; 
else  if  number  of  fan-in  node  ofy>1  then 

level  ofy  =MAX(  levels  of  fan -in  nodes  ofy)  +1; 
Add  node  y  into  q; 


Phase  II:  Partitioning 

Sort  all  nodes  according  to  the  topological  order; 

Partition  the  graph  G  by  constraints  (max  node  number  &  depth); 

End 


Figure  5.  Partitioning  Algorithm 

4.2  Activity-Aware  Delay  Assignment 

Figure  6  presents  an  example  of  the  module  level  delay  assignment 
algorithm.  In  the  first  step,  each  module  is  sorted  by  the  amount  of 
load  capacitance  of  each  module  (step  1).  According  to  the  priority  of 
each  module,  we  assign  maximum  delay  with  the  “objective 
function”  and  “delay  assignment”  formula  in  Fig.  6  (Step  2  and  3). 
Then  we  look  at  the  local  improvement  by  local  search  (step  4).  If  all 
modules’  delays  are  assigned,  conduct  the  technology  parameter 
optimization  at  the  gate  level  (step  5).  Finally,  we  generate  the 
power/area  saving  values  and  optimal  parameters.  In  the  algorithm, 
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each  module  (Ml,..., Mi)  can  be  a  functional  module  or  a  sub¬ 
partition,  the  total  physical  capacitance  of  a  module  can  be  the  sum 
of  the  fan-in/out  counts  inside  the  module,  and  the  load  capacitance 
of  each  module  can  be  calculated  by  multiplying  the  total  switching 
activities  by  the  total  fan-in/out  net  counts.  Its  algorithm  is  shown  in 
Figure  7.  The  complexity  of  the  algorithm  is  0(nbm ),  where  n  is  the 
number  of  modules,  b  is  the  branching  factor  (i.e.,  average  fan-out 
number)  and  m  is  maximum  topological  depth. 


*  Original  Module  Delay  =  Balanced  similarly 
*  Each  Module  Load  Capacitance 
=  Total  Switching  Activity  *  Total  Capacitance 
=  20 


Cycle  Time  (Tmax)  =  30  ns 


Object :  Assign  max  delay  of  each  module  for  max  power  saving 
(Note:  slack  time  =  power/area  saving) 

=  A 
C* 


Object  Function  =  = 

Q  C , 


c,  =  Z  net 

node  i 

(77.  =  switching  activity  at  node  i,  c.  =  capacitance  at  node  i) 


Delay  ssignment  = 


Total  Load  Capacitance  Sum  in  Path 

Step  1 :  Module  Priority  queue  for  each  module 
by  load  capacitance 


x  (Li  -  Assigned  Delay  Sum) 


M4,  Ml  (20) 


M2  (15) 


M3  (10) 


M5,M6  (5) 


Step  2: 


-  Select  M4 

-  Path  Priority  queue  for  each  path  with  M4 


Pathl:  Ml ,M2,M4,M6  (60) 


Path2:  M1,M3,M4,M6  (55) 


-  Select  Pathl 

-  Delay  of  M4  =  (20/60)*30  =  10  ns 

Step  3:  -  Repeat  Step2  for  all  modules 

-  Delay  of  Ml  =  (20/40)*20  =  10  ns 

-  Delay  of  M2  =  (15/20)*10  =  7.5  ns 

-  Delay  of  M3  =  (10/1 5)*10  =  6.66  ns 

-  Delay  of  M5  =  10  ns 

-  Delay  of  M6  =  2.5  ns 

Step  4:  -  Local  search  improvement 

-  Increase  6.66  to  7.5  for  M3 

Step  5:  -  Go  to  Gate  level  optimization  (Vdd,  Vth,  W  Scaling) 
with  this  Max  delay  of  each  module 


Figure  6.  An  Example  of  Delay  Assignment 


HA2TSD  Algorithm  2  :  Delay  Assignment 


Input:  Partitioned  sub-graphs  Gj  (Vj}  EJ 

Output:  Delay  weighted  sub-graphs  Gt  (Vj}  Ej}  W(Vj)) 

Begin 

Phase  0:  Initialization 

Enumerate  the  critical  paths  P.}  in  G  =  {G1  ...  Gj}; 

Sort  Pj  in  decreasing  order  of  criticality; 

Phase  I:  Delay  assignment  for  each  path 
Identify  maximum  delay  Tmax  of  all  paths; 

Calculate  switching  activity  CCj  for  all  nodes  V. 

Set  the  delay  for  nodes  V.  on  critical  path(s) 

While  ( all  path  P,  ) 

{ 

While  ( unassigned  V.  =  0) 

{ 

Cax  (V,)  =  fa / (a, +  "+<*M  +  aM*~*  an )1  *  Tmax ; 
/*  where  n  =  number  of  nodes  on  the  Pj  V 

W(V,)  =Tmx(v,)  : 

}  } 

End 


Figure  7.  Delay  Assignment  Algorithm 

4.3  Gate-level  Power  Optimization 

There  are  three  ways  to  save  power  dissipation  while  maintaining 
operation  frequency  by  utilizing  surplus  time  slack  in  non-critical 
paths:  i)  employing  multiple- Vdd  to  lower  supply  voltage,  ii) 
employing  multiple- Vth  to  reduce  leakage  current,  and  iii)  employing 
multiple-W  to  reduce  circuit  capacitance.  In  this  paper,  the  Vdd 
reduction  is  main  scaling  parameter  for  low  power,  and  Vth  and  W 
scaling  is  mainly  for  creating  more  time  slack  for  the  ultra-low 
power  optimization.  The  difficulties  of  the  power  optimization  at  gate 
level  come  from  two  major  aspects:  i)  the  non-linear  interactions  of 
the  object  parameters,  for  example,  each  gate  has  at  least  four  non¬ 
linear  variables  (Vdd,  Vth,  W,  Delay)  and  ii)  the  optimization  time 
complexity,  for  example,  after  logic  synthesis  of  target  system,  each 
functional  module  (i.e.,  ALU,  Adder,  Multiplier,  etc.)  might  generate 
large  number  of  gates/interconnections  and  the  searching  space  for 
the  optimization  is  exponential.  Therefore,  simulation-efficient 
partitioning  scheme  should  be  judiciously  considered  before  the  gate 
level  optimization.  The  Figure  8  shows  the  relationship  between  the 
maximum  delay  assignment  and  the  technology  scaling  for  power 
savings. 


After  the  maximum  delays  have  been  assigned  to  each  module/gate  in 
the  circuit,  we  optimize  each  gate  individually  for  minimum  power. 
The  strategy  is  to  find  iteratively,  using  binary  search,  the  optimal 
combination  of  Vdd,  Vth,  and  W  for  each  gate  that  meets  the 
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maximum  delay  condition  while  achieving  minimum  power 
dissipation.  We  used  our  previous  work  for  the  gate  level  power 
optimization  [6].  This  strategy  is  based  on  the  observation  that  power 
consumption  and  delay  are  monotonic  functions  of  Vdd,  Vth,  and  W, 
individually,  other  parameters  being  fixed.  Since  it  is  impractical  to 
have  more  than  one  power  supply  or  threshold  voltage  in  the  circuit, 
we  keep  only  one  global  value  of  Vdd  and  Vth.  However,  the 
algorithm  could  be  easily  modified  to  allow  the  use  of  multiple 
threshold  values  in  the  circuit  if  desired.  The  algorithmic  complexity 
of  this  procedure  depends  on  the  number  of  iteration  steps  that  we 
allow  for  convergence  to  the  optimal  values.  Assuming  that  VDD,  Vth 
and  W  are  each  constrained  to  2M  quantized  values,  it  takes  0(M3) 
simulations  of  the  entire  circuit  to  obtain  the  final  optimal  values. 
This  is  many  orders  of  magnitude  lower  than  the  complexity  of  any 
direct  or  random  search  algorithm  that  may  be  used  to  search  for  the 
optimal  solution. 

5.  RESULTS 

We  developed  a  simulation  frame  work  with  C/C++/STL  and  Perl  on 
Ultra-80  Unix  machine  for  the  hierarchical  power  optimization.  Also, 
we  used  off-the-shelf  commercial  tools  for  the  RTL  description,  the 
functional  verification,  and  the  logic  synthesis  of  the  target  system.  A 
few  arithmetic  modules  from  the  target  system  and 
ISCAS89/MCNC91  benchmark  circuits  are  used  for  the  experimental 
demonstration.  For  the  range  of  the  technology  parameter  values,  the 
2001  updated  version  of  ITRS  (International  Technology  Roadmap 
for  Semiconductors)  and  the  MOSIS  (Integrated  Circuit  Fabrication 
service)  parameter  test  results  with  TSMC  0.25  micron  are  used.  For 
the  RTL  design,  we  used  verilog  hardware  description,  for  the 
functional  simulation,  we  used  VCS  (synopsys),  and  for  the  logic 
synthesis,  we  used  design  analyzer  (synopsys)  with  0.25  micron 
TSMC  library. 

Monte  Carlo  simulation  is  performed  for  activity  profiling  of  each 
module/sub-module  as  described  in  [2].  This  approach  consists  of 
applying  randomly  generated  input  patterns  at  the  primary  inputs  of 
the  circuit  and  monitoring  the  switching  activity  per  time  interval  T 
using  a  simulator.  Under  the  assumption  that  the  switching  activity  of 
a  circuit  module  over  any  period  T  has  a  normal  distribution,  and  for 
a  desired  percentage  error  in  the  activity  estimate  and  a  given 
confidence  level,  the  number  of  required  simulation  vectors  is 
estimated.  The  simulation  based  approach  is  accurate  and  capable  of 
handling  various  device  models,  different  circuit  design  styles,  single 
and  multi-phase  clocking  methodologies,  tristate  drives,  etc. 

Figure  9  shows  the  hierarchy  and  the  granularity  that  we  used  in  our 
simulation.  In  this  paper,  we  only  simulated  3 -level  hierarchical  case. 
Table  1(a)  shows  the  total  power  consumption  with  fixed  technology 
parameters  for  the  given  circuits.  Table  1(b)  demonstrates  the 
efficiency  and  effectiveness  of  the  hierarchical  power  optimization 
with  the  proposed  design  flow.  The  experimental  results  show  that 
our  power  optimization  strategy  delivers  an  order  of  magnitude 
savings  in  total  (static  and  dynamic)  power  without  performance 
degradation  over  non-optimized  benchmark  circuits  and  our 
hierarchical  approach  is  much  faster  than  traditional  approach.  With 
the  hierarchical  depth  of  3  as  shown  in  Figure  9,  we  can  obtain 
average  6  times  faster  optimization  than  the  totally  flattened  case 
when  we  still  have  average  83.6%  power  savings. 


Figure  9.  Hierarchy  in  our  Simulation 


6.  CONCLUSION 

This  paper  presents  an  efficient  hierarchical  low-power  design  flow 
and  a  novel  switching  activity  based  optimization  algorithm  for  ultra- 
low  power  CMOS  VLSI.  Experimental  results  show  that  the 
algorithm  yields  reductions  in  power  by  typically  a  factor  from  19.6x 
to  52.4x  with  optimal  Vdd/Vth  and  multiple  W  scaling.  In  summary, 
key  contributions  of  the  new  power  minimization  technique  is:  i) 
without  compromising  the  speed,  the  total  (static  and  dynamic)  power 
is  minimized  significantly;  ii)  with  the  hierarchical  approach, 
polynomial  time  optimization  is  feasible  in  very  large  circuits;  and 
iii)  the  activity- aware  delay  assignment  ensures  that  the  total  time 
slack  is  maximum  and  the  total  power  is  near-minimal.  Future  work 
will  include  application-specific  and  architecture-driven  issues  with 
this  technology  scaling  techniques. 


Table  1.  Results  of  H2TSD-Based  Power  Optimization 


(a)  Before  Optimization  (Fixed  Vdd:3.3v,  Vth:0.7v) 


System 

Module 

Gates/ 

Delay 

(ns) 

Input 

Activity 

tn,  <7co 

Power  Dissipation 

Depth 

Leakage 

Switching 

Short-ckt 

Total 

4-  Full 

106/48 

3.36 

0.5 

17.9,  24.5 

2.09x1 0E-20 

4.37x1 0E-11 

2.15x10E-12 

4.59x1 0E-11 

Adder 

0.05 

17.9,  24.5 

2.09x1 0E-20 

4.33x10E-12 

2.13x10E-13 

4.54x1  OE-12 

16-  Look 

1838/ 

7.0 

0.5 

5.9,  6.2 

1.48x10E-19 

7.65x1 0E-10 

9.33x1 0E-11 

8.58x1 0E-10 

ahead 

81 

0.05 

5.9,  6.2 

1.48x10E-19 

1.39x10E-10 

9.29x1 0E-1 2 

1.48x10E-10 

64  -  ALU 

3417/ 

18.6 

0.5 

6.1,  9.2 

1.12x10E-18 

4.4x1 0E-09 

2.87x1 0E-10 

4.69x1 0E-09 

226 

0.05 

6.1,  9.2 

1.12x10E-18 

1.90x10E-10 

2.87x1  OE-12 

1.93x10E-10 

s298 

286/18 

3.02 

0.5 

4.8,  7.7 

1.92x10E-20 

1 .44x1 0E-1 1 

2.37x1 0E-1 3 

1 .46x1 0E-1 1 

0.05 

4.8,  7.7 

1.92x10E-20 

1.39x10E-13 

2.55x1 0E-1 5 

1.42x10E-13 

s344 

229/28 

3.86 

0.5 

15.9,26.2 

4.59x1 0E-20 

6.38x1 0E-11 

9.87x1 0E-1 3 

6.48x10E-1 1 

0.05 

15.9,26.2 

4.59x1 0E-20 

6.39x1 0E-1 3 

9.62x1 0E-1 5 

6.49x10E-13 

s386 

426/23 

3.99 

0.5 

10.9,  9.2 

5.56x1 0E-20 

4.88x1 0E-11 

9.99x1 0E-1 3 

4.98x1 0E-11 

0.05 

10.9,  9.2 

5.56x1 0E-20 

5.13x10E-13 

9.62x1 0E- 15 

5.23x1 0E-1 3 

s526 

596/18 

4.3 

0.5 

5.2,  7.8 

5.88x1 0E-20 

5.13x10E-1 1 

2.00x1  OE-12 

5.33x1 0E-11 

0.05 

5.1,  7.8 

5.88x1 0E-20 

5.32x1 0E- 13 

9.82x1 0E- 15 

5.41x1  OE-13 

C6288 

2406/ 

10.6 

0.5 

4.7,  8.2 

6.52x1 0E- 18 

3.21x1 0E-09 

6.55x1 0E-10 

3.87x1 0E-09 

129 

0.05 

4.3,  8.1 

6.52x1 0E- 18 

4.39x1 0E-10 

6.54x1  OE-12 

4.76x1 0E- 10 

30 


(b)  After  Optimization  (  Vdd:0.6-1 .2 v,  Vth:0.1-0.52v) 


System 

Module 

Hiera 

rchy 

Granulara 

ty 

Input 

Activity 

Vdd,Vth 

tn,  <7co 

Total 

Power 

Savings 

Power 

Run-Time 

4-  Full 
Adder 

0 

Each  gate 

0.5 

0.6,  0.1 

6.61, 8.14 

1.95x10E-1 1 

57.5% 

Ox 

0.05 

0.7,  0.1 

5.62,  7.19 

3.14x10E-13 

31.0% 

Ox 

2 

Level  1 :  53 
Level  2:  17.7 

0.5 

0.625,  0.1 

6.62,  3.14 

2.95x1 0E-11 

35.7% 

4.28x 

0.05 

0.725,  0.2 

6.99,  5.64 

3.57x1 0E-1 3 

21.5% 

4.28x 

3 

Level  1 :  53 
Level  2:  17.7 
Level  3:  2.9 

0.5 

0.7,  0.12 

7.3,  6.22 

3.19x10E-1 1 

30.4% 

18.8x 

0.05 

0.725,  0.2 

8.6,  9.0 

3.67x1 0E-1 3 

19.4% 

18. 8x 

16-  Look 
ahead 

0 

Each  gate 

0.5 

0.8,  0.1 

3.1,2.14 

2.93x1 0E-11 

96.6% 

Ox 

0.05 

0.825,  0.1 

3.66,  2.94 

8.03x1 0E-1 3 

90.7% 

Ox 

2 

Level  1:919 
Level  2:  306.  ■ 

0.5 

0.8,  0.1 

4.19,  1.14 

3.09x1 0E-11 

96.4% 

2.26x 

0.05 

0.825,  0.1 

4.45,  7.10 

8.66x1 0E-1 3 

90.0% 

2.26x 

3 

Level  1:919 
Level  2:  306.  ■ 
Level  3:  51.1 

0.5 

0.8,  0.1 

5.01, 4.14 

6.40x1 0E-11 

92.5% 

3.08x 

0.05 

0.85,  0.12 

4.91,6.16 

1.02x10E-12 

88.2% 

3.08x 

64-ALU 

0 

Each  gate 

0.5 

0.9,  0.1 

5.71, 3.13 

5.26x1 0E- 11 

98.9% 

Ox 

0.05 

0.925,  0.1 

5.91, 5.10 

2.34x10E-12 

98.8% 

Ox 

2 

Level  1:  1708 
Level  2:  569.' 

0.5 

0.925,  0.1 

3.63,  3.13 

5.50x1 0E-11 

98.8% 

2.1  Ox 

0.05 

0.95,  0.12 

4.62,  5.12 

9.60x1 0E-1 2 

95.0% 

2.1  Ox 

3 

Level  1:  1708 

0.5 

0.95,  0.12 

3.51, 8.15 

8.09x1 0E-11 

98.3% 

2.1  Ox 

Level  2:  569. J 
Level  3:  94.9 

0.05 

0.925,  0.2 

5.81, 6.14 

2.30x1 0E-11 

88.1% 

2.70x 

s298 

0 

Each  gate 

0.5 

0.6,  0.1 

2.62,  4.44 

2.52x1 0E-1 3 

98.3% 

Ox 

0.05 

0.625,  0.1 

3.21, 7.14 

4.46x1 0E-1 5 

96.8% 

Ox 

2 

Level  1:  143 
Level  2:  47.7 

0.5 

0.625,  0.1 

3.61, 3.14 

5.51x10E-13 

96.2% 

3.13x 

0.05 

0.625,  0.1 

3.31, 4.19 

1.09x10E-14 

92.3% 

3.13x 

3 

Level  1:  143 
Level  2:  47.7 
Level  3:  7.94 

0.5 

0.625,  0.1 

4.11,4.14 

8.55x1 0E-1 3 

94.2% 

6.48x 

0.05 

0.65,  0.12 

4.31, 2.94 

1 .45x1 0E- 14 

89.8% 

6.48x 

s344 

0 

Each  gate 

0.5 

0.7,  0.1 

8.61, 9.34 

6.44x1 0E-1 3 

99.0% 

Ox 

0.05 

0.725,  0.2 

9.21, 3.14 

8.31x10E-14 

87.2% 

Ox 

2 

Level  1:  115 
Level  2:  38. If 

0.5 

0.8,  0.12 

12.1, 5.14 

2.03x10E-12 

96.9% 

3.32x 

0.05 

0.8,  0.12 

9.61,2.14 

9.36x1 0E-1 4 

85.6% 

3.32x 

3 

Level  1:  115 

0.5 

0.85,  0.1 

7.61, 3.15 

6.02x10E-12 

90.7% 

7.62x 

Level  2:  38. 1£ 
Level  3:  6.36 

0.05 

0.825,  0.2 

9.61,2.14 

1 .35x1 0E- 13 

79.2% 

7.62x 

s386 

0 

Each  gate 

0.5 

0.6,  0.1 

4.61, 5.14 

4.43x1 0E-1 3 

99.1% 

Ox 

0.05 

0.6,  0.1 

5.88,  3.74 

1.82x10E-14 

96.5% 

Ox 

2 

Level  1:213 
Level  2:  71 

0.5 

0.6,  0.1 

7.61, 9.10 

4.63x1 0E-1 3 

99.1% 

2.86x 

0.05 

0.625,  0.1 

7.61,4.14 

1.92x10E-14 

96.3% 

2.86x 

3 

Level  1:213 
Level  2:  71 
Level  3:  11.8 

0.5 

0.625,  0.1 

9.33,  4.14 

9.58x1 0E-1 3 

98.1% 

5.13x 

0.05 

0.65,  0.12 

10.01, 9.1 

2.16x10E-14 

95.9% 

5.13x 

s526 

0 

Each  gate 

0.5 

0.6,  0.1 

3.61, 3.34 

6.91x10E-13 

98.7% 

Ox 

0.05 

0.625,  0.1 

4.55,  5.15 

1.84x10E-14 

96.6% 

Ox 

2 

Level  1:298 
Level  2:  99.3 

0.5 

0.625,  0.1 

4.21,2.14 

1 .00x1 0E- 12 

98.1% 

2.68x 

0.05 

0.625,  0.1 

4.21, 5.94 

2.46x1 0E-1 4 

95.4% 

2.68x 

3 

Level  1:298 
Level  2:  99.3 
Level  3:  16.6 

0.5 

0.625.  0.1 

5.61. 6.14 

1 .93x1 0E- 12 

97.4% 

4.39x 

0.05 

0.65,  0.12 

4.91, 7.14 

3.62x1 0E- 14 

93.3% 

4.39x 

C6288 

0 

Each  gate 

0.5 

0.925,  0.1 

5.61, 3.14 

3.49x1 0E-11 

99.1% 

Ox 

0.05 

0.9,  0.12 

4.67,  4.44 

o 

o 

m 

98.5% 

Ox 

2 

Level  1:  1203 
Level  2:  401 

0.5 

0.925,  0.1 

3.91, 5.14 

o> 

o 

m 

98.6% 

2.19x 

0.05 

0.825,  0.1 

5.69,  4.14 

7.97x1  OE-12 

98.3% 

2.19x 

3 

Level  1:  1203 
Level  2:  401 
Level  3:  66.8 

0.5 

0.95,  0.2 

4.61,6.99 

8.81x1 0E-11 

97.9% 

2.90x 

0.05 

0.925,  0.2 

5.71, 5.54 

7.03x1 0E-11 

85.2% 

2.90x 

Median 

0 

Each  gate 

0.5 

98.8% 

Ox 

0.05 

96.6% 

Ox 

2 

Level  1:511 
Level  2:  85.1 

0.5 

97.5% 

2.77x 

0.05 

93.7% 

2.77x 

3 

Level  1:511 

0.5 

95.8% 

4.76x 

Level  2:  85.1 
Level  3:  14.2 

0.05 

88.1% 

4.76x 

Average 

0 

90.2% 

Ox 

2 

87.1% 

2.86x 

3 

83.6% 

6.39x 
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